NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3368 Convergence Guarantees for Adaptive Bayesian Quadrature Methods

### Reviewer 1

# After rebuttal Thank you very much for the response. I will try to further elaborate my comments below. Proposition 4.2 This result is actually the key to unlocking the contributions of the paper and not Lemma 3.2. Apologies if my review was not clear on this. I have checked the updated proof and your arguments are mathematically correct. However, I still have problems with the constants. In particular, a minor issue is that for n >= n_0 and cases with c_5 > 1 the bound is not tight. A problematic part is the case with n < n_0 and c_4 > 1. Observe that c_4 = c_3 exp ((c_1 / c_2) n_0^{1\d}) and that in the latter case k_{X_n}(x, x) <= c_3 exp ((c_1 / c_2) (n_0^{1\d} - n^{1/d})) with n_0^{1\d} - n^{1/d} > 0, which implies that exp ((c_1 / c_2) (n_0^{1\d} - n^{1/d})) > 1. The bound is worse than c_3 alone which holds for all x, irrespective of n. Thus, the updated bound does not really guarantee that for n < n_0 one will observe decay of exp (-n^{1/d}). In my understanding, the result only guarantees _asymptotic_ decay of exp(-n^{1/d}). Lemma 3.2 In my view, this result follows from Lemma 3.1 directly. Observe that Lemma 3.1 tells you that k_{X_l}(x, x) is the squared distance from the evaluation functional k(x, *) in RKHS to its projection onto the span of the evaluation functionals centered at instances from X_l (I am omitting q(x) because that can be considered as an importance weight and does not change anything). If all the evaluation functionals define a linearly independent set of vectors the kernel matrix is non-singular (n instances with n >= l). If the set is linearly dependent then there exists an evaluation functional (i.e., vector) in the span of others. For that particular vector, the distance to its projection onto the span will be zero. So, Lemma 3.2 says the set of evaluation functionals is either linearly independent or linearly dependent. # The paper studies convergence properties of adaptive Bayesian quadrature methods. This class of quadrature methods deals with an integrand for which it is known beforehand that it satisfies a certain constraint such as being positive. The main idea is to introduce a link function and apply it to the output of a latent function such that the resulting composition of the two functions satisfies the desired constraint. The latent function is modeled with a Gaussian process and the so called design points are selected deterministically by optimizing an acquisition function. The paper considers a generic acquisition function that covers several of the previously proposed strategies for selecting design points in the context of adaptive Bayesian quadrature methods. The first theoretical result provides an upper bound on the approximation error of adaptive Bayesian quadrature methods in terms of posterior variance under fairly standard assumptions (Proposition 2.1). Following this, the upper bound based on the posterior variance is related to  where a weak greedy algorithm is considered with an asymptotic convergence guarantee. The greedy algorithm  restricted to a Hilbert space deals with an approximation of the inner product on that space using the inner product defined on a subspace spanned by some landmarks. The landmarks are selected iteratively so that they are at least \gamma-fraction of the optimal choice. The quality of the approximation is measured by first projecting the original space to the subspace spanned by the landmarks and then taking the distance between a point from the original space that is furthest away from the span. This link from Proposition 2.1 to the weak greedy algorithm from  hinges on Lemmas 3.1 and 3.2. These two lemmas are fairly well-known results and follow directly from [2, 3] (the results can also be found in some derivations of the Nyström method). In my opinion, these two lemmas are not entirely novel and should be credited to previous work. Theorem 3.3 gives an asymptotic convergence guarantee for adaptive Bayesian quadrature methods. For the purpose of deriving convergence rates, the paper introduces the notion of Kolmogorov n-width (Section 4). In essence, it is the worst case approximation error for the best possible choice of the landmarks. Denoting the Kolmogorov n-width with d_n and the weak greedy approximation error with e_n, the following inequality can be established d_n <= e_n. Given that d_n <= e_n, it is not entirely clear from the paper itself how an upper bound on d_n implies an upper bound on e_n (Lemma 4.1). It would be nice to provide a clarification of this claim in any future version of the paper (appending part (i) from [1, Corollary 3.3]). Section 4.1 goes beyond asymptotic convergence and provides explicit convergence rates for adaptive Bayesian quadrature methods. I have some concerns regarding the proofs from this section and will try to clarify them below. # Section 4.1 (Convergence rates) & Appendix C I fail to see why the proof of Proposition 4.2 holds and will try to clarify my concerns (Appendix C, lines 480--488). i) k_{X_n} (x, x) <= exp (-c_1 / h_{X_n, \Omega}) holds for some constant c_1 and sufficiently small h_{X_n, \Omega} ii) h_{X_n, \Omega} = c_2 n^{-1 / d} holds for some constant c_2 and {x_i} selected so that they are equally-spaced grid points of \Omega iii) (i) and (ii) ==> k_{X_n} (x, x) <= exp (- (c_1 / c_2) n^{1 / d} ) for large enough n iv) there exists a constant c_3 such that k_{X_n} (x, x) <= c_3 for all n and for all x \in \Omega ----- ??? ----- v) k_{X_n} (x, x) <= c_3 exp (- (c_1 / c_2) n^{1 / d} ) for all n and for all x subject to {x_i} being equally spaced. ----- ??? ----- I can follow the reasoning in (i)-(iii) and have no problems with any of the claims in (i)-(iv). However, I cannot see why the upper bound in (v) would hold for all n. In particular, we have that (iii) holds only for large enough n which can be so large that we will never select that many points in adaptive Bayesian quadrature methods. The claim in (iv) holds for all n and for all x but as n gets larger the term exp (- (c_1 / c_2) n^{1 / d} ) << 1 and, thus, the right-hand side of (v) can fail the upper bound from (iv) for some n. Such n might also not be sufficiently large for (iii) to hold. Now, this is an important result and failure to demonstrate a convergence rate in Proposition 4.2 implies that Theorem 4.3 does not hold either. This further implies that Corollary 4.4 does not hold and there are no explicit convergence rates for adaptive Bayesian quadrature methods. Please provide clarification for this in the rebuttal, as a mistake would take away the claimed contribution. I think there is a similar problem in Appendix C.4.1 with polynomial decay rates. References:  DeVore, Petrova, Wojtaszczyk (Constructive approximation, 2013). Greedy algorithms for reduced bases in Banach spaces.  Smola & Schölkopf (ICML 2000). Sparse greedy matrix approximation for machine learning  Schölkopf & Smola (2000). Learning with kernels (Section 10.2, Sparse greedy matrix approximation).

### Reviewer 2

Summary: The paper’s main contribution, in my opinion, is to connect a rather wide class of adaptive Bayesian quadrature algorithms to the general class of week-greedy algorithms in an RKHS, enabling the authors to analyze the convergence of these algorithms as special cases of existing theory. The paper is rather dense, and I wish the authors provided more intuition about the role and interpretation of various terms in Eqn. (4). By contrast, I think the connection to week-greedy algorithms is presented more clearly, and in my opinion is generally interesting, even for those not interested in convergence rates, just seeking to gain a different perspective on this class of algorithms. I am not aware of any published work studying convergence rates of this wide class of algorithms, and while I can’t confidently comment on how tight or 'surprising' the results are, I think it is an important first contribution in this domain. Detailed comments: Line 120: I am puzzled as to why this estimator is used for the integral. For each value of the function g, there is an estimator for the integral. However, one has to then take an expectation over the posterior over g. Here, it seems that we’re taking an expectation over g first (to obtain \mu_{g,X_n}) and then plugging the mean into the estimator. Due to Jensen’s inequality, depending on the convexity of T, this is likely to under- or overestimate the posterior mean. I briefly looked at  which is referenced here, but even there, Table 1 suggests a different estimator should be used. Can the authors please comment on this, and perhaps expand on this in the paper. Section 2.2: A generic form of acquisition functions is given here, but it is unclear if this form, and its various components have an intuitive interpretation. While it’s possible to check if all the referenced acquisition functions do indeed fit this general formulation, it feels very generic, and it’s unclear why this is the right level of abstraction. I would have appreciated a discussion of the role the various components in Eqn (4) play. Similarly, it is unclear to me if the authors think that any arbitrary special case of (4) has a clear interpretation as ABQ. Lines 160-167: I do wonder whether this compressed generic description of RKHS basics is useful for the reader here, I think this could be sacrificed in favour of providing more intuitive explanation of the results. Section 3: I liked this relatively compact but informative overview of gamma-week greedy algorithms.

### Reviewer 3

Summary The authors analyse adaptive Bayesian quadrature in a general form. Via relating the algorithm to weak greedy set approximation algorithms in RKHS , they prove consistency and provide high level convergence rates of the estimated integrals. An excellent paper: highly relevant and worthwhile problem, a neat solution, very nice to read, theoretically thorough. I enjoyed reading it and have very few points of critique. Significance Adaptive Bayesian quadrature methods have been shown to work very well in practice, with many published methods. Phrasing those within a unified framework, and proving consistency and its conditions therefore is a major step forward in this field. Clarity The paper is very well written, with great introduction and conclusions. Despite the very technical nature, the series of arguments that lead to the final rates are reasonably easy to follow. Moving all proofs to the Appendix is obviously required to to space constraints, although I liked the geometric interpretation of the proof of Lemma 3.1 and think it might be a nice fit for the main text. I am not sure it is possible though If anything, the conclusions and intro are slightly redundant and potentially could be shorted in favour of the figure. However, since the paper is incomplete without the proofs anyways, maybe that is not worth it. The relation to weak greedy algorithms My compliment to the authors for realising and establishing the connection -- very neat! Simulations The paper does not contain any simulations, but I think that is justified. Bayesian quadrature vs Monte Carlo I am somewhat sympathic that the authors promote BQ over MC methods in the intro, since this is a BQ paper. However, I think the presentation is slightly one-sided: while BQ methods certainly have their merits for certain types of problems, the are definitely not a silver bullet. One obvious (and probably often invoked) comment here is that the convergence rate when estimating integrals with MC is independent of the dimension, whereas the BQ rates suffer quite a bit. Of course all this depends on the problem at hand and the context, but my point is simply that MC, especially when combined with MCMC, has been a hugely successful technique for a reason -- and it is unlikely that it will be fully replaced by BQ methods. The example cited in via  is not representative whatsoever, but rather cherry picking. I feel the general tone should be rephrased to be more neutral -- highlighting pros and cons of each side. Ergodicity / detailed balance / weak adaptivity The abstract, intro, and conclusions use the these terms in a quite bold fashion. However, the main text and analysis falls short in even mentioning them -- not to speak of any formal connection. I think this is bad practice (over-selling) and suggest a more minimal comment in a discussion section instead. Minor: I think an intuitive discussion of the weak adaptivity condition would be helpful immediately after stating it. Currently, this is done in the conclusion (line 316), which might be a bit late. Supplementary material / proofs The main text has a very nice flow of arguments, while the Appendix is slightly less so. I was able to more or less follow most proofs in the Appendix, but it would be helpful if the authors would provide some high level overview or map of the different components, how they interact, and what techniques are used within. This would help the reader substantially in my opinion. Thanks to the authors for commenting on my points of critique in the rebuttal. I am happy with the response and the planned edits. During the reviewer discussion, some points of critique on the proof (and the updated version from the rebuttal) were raised. Namely that the provided rate of decay only holds asymptotic (see comments of R1). It might be possible to fix that and we encourage the authors to either do that or adjust their discussion of the rate. In the discussion, we agreed that consistency is nevertheless a major step forward.