Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
General comments / questions (see elsewhere in this review for contributions/originality/significance): - The paper is generally well-written and clear given the (at times) technical content. - The approach appears technical sound and I find the insight presented in section 3.1 (and proof in the supplementary) together with resulting algorithm quite interesting (and possibly useful in other domains). I’ve read through and haven’t detected any obvious flaws but I’d suggest to double check the technical proofs and assumptions (especially in the appendix). - Sec 2 seems a bit long compared to the new insight presented in Sec 3. - The experiments appear relevant and mostly well-executed. My main concern is the perceived benefit of the two-step approach. It is especially difficult for a reader to quickly see the clear benefit of the 2-OPT (or any variant of it) when looking at the “real-world” HPOlib and ATO benchmarks. It would perhaps be relevant to further focus on and emphasise the robustness of the 2-step approach over alternatives. Also, given that a primary goal of the paper is to improve time complexity, not introduce a new acquisition function per se, I would perhaps have expected an more detailed empirical comparison of the time complexity among different multi-step (at least two) BO approaches (especially GLASSES) (possibly in the appendix). - The variance on the optimization traces for EI and LCB are often very small (despite a very non-smooth traces for e.g. 5d Ackley fig 2) or non-existent; any reason or insight into this? Minor: - L 143: Define Q (I assume it is a typo or last-minute change in notation). - L 233-244; please check for minor typos; esp. l 238 - Figures are bit hard to read due to axis labels and markers etc being rather small; overlapping graphs etc.
Summary: This paper proposes a new acquisition function for Bayesian optimization that considers not only one step in the future, but two. This is expected to be more accurate. However, taking into account that there are 2 evaluations remaining is a challenging task. The required computations are intractable and approximations need to be made. The authors suggest using a Monte Carlo approximation of the objective, which is then combined with stochastic gradient optimization. Some theoretical results about the convergence of the proposed approach are given. The method is also evaluated and compared with other non-myopic approaches for BO alongside with standard myopic approaches. These experiments show some gains. Comments: I believe that the problem addressed by the paper is interesting and relevant for the community. The problem is, however, that the proposed approach seems to be a slight variation of the already known methods. In particular, the only contribution seems to be the optimization of the acquisition function which is done using stochastic gradient algorithms. This seems to be a rather weak contribution. The experimental setting seems to be limited in the sense that only a few real datasets are provided, and in those (HPOlib) the proposed approach does not seem to give significantly better results. The authors only compare with related methods only on a set of synthetic problems extracted from another reference. I wonder why the did not compare on each problem with those techniques. The authors also claim that the proposed approach is faster than competing methods. However, no experiments showing this are provided. The authors also say that their approach supports using a batch evaluation setting for the first step. However, in the experiments, it is not clear what batch size they use. There are also some typos in the paper: E.g. line 110, line 125, line 143 (what is Q?), line 154 (what is EI_2?), and line 152 (K(X_1,x) is missing a subindex). From my point of view the paper rather weak and marginally below the acceptance threshold. The reproducibility checklist is wrongly completed by the authors. Summing up, I believe that this paper needs more work. In particular, the experimental section needs to be improved and to actually show the benefits of the proposed approach.
Originality: This problem has been studied before. However, this paper proposes a method for estimating the gradients of the acquisition functions. Quality: The paper has some theoretical analysis. However, more experimental results are required. Clarity: The paper can be understood. However, some of the text requires rewriting to make it easier to understand. Significance: With more empirical results, more researchers can use the proposed method.