Review for NeurIPS paper: Federated Bayesian Optimization via Thompson Sampling

NeurIPS 2020

Federated Bayesian Optimization via Thompson Sampling

Review 1

Summary and Contributions: * The paper studies collaborative Bayesian Optimization in a Federated Learning setting. * The core idea is to utilize random Fourier feature approximations of GPs to avoid exchanging raw observations * The paper derives regret bounds for the approach, and compares its empirical performance to that of some (not really appropriate) alternatives in simulations.

Strengths: * This paper provides an interesting view of Bayesian Optimization in the context of Federated Learning, and seems practically relevant (though I note my limited exposure in this domain). * Using RFF GPs is an elegant way to ensure data is not shared between agents * The theoretical contribution of the regret bound appears sound, albeit not particularly surprising (see below).

Weaknesses: * The sublinear asymptotic regret bound is not particularly interesting, as it hinges on the sequence p_t converging to 1, meaning that information contributed by other agents will asymptotically be washed out by purely the agent's own information. So without the specifics of the bound, this is very intuitive, and since BO is a global optimization methodology there isn't concern that the agent will get stuck in some minimum that would prevent it from recovering from being mislead early on. However, the specific form of the bound does provide some insight into the dependency on the number of features. * The comparison with RGPE and TAF do not really seem appropriate, as those methods were never developed with the setting considered in this paper in mind. If these are the only points of comparison it's fine to have them in the paper, but the authors should state clearly that the provided comparisons are really against non-competing methods. * The authors only consider RFF, wheres it seems that a number of other models (stochastic variational GPs, inducing point methods) could be used as alternative ways of avoiding to share raw observations between agents.

Correctness: * The theoretical contribution of the regret bound appears sound * The empirical results are presented well, though further details on the simulation environment would be helpful to better understand the results. * The authors state that "when x_t is selected using \omega_n from an agent, FTS only needs to solve a linear optimization problem of dimension M" - This seems wrong to me, as the cosine non-linearities in the FFs would imply that this is a (hard) nonlinear optimization problem.

Clarity: * The paper is quite well written, and is relatively easy to follow. * I feel that the authors should provide more intuition for why FTS performs better than TS. Essentially, my understanding here is that each agent benefits (albeit implicitly) from having (N+1) times as many observations. This should be discussed in the paper. Also, the results will depend heavily on simulation environment, which should be discussed in more detail.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: - the authors state that "the Gaussian process (GP) model, which is the most commonly used surrogate model in BO since it is required for deriving the theoretical convergence guarantee". This statement doesn't really make sense - theoretical convergence guarantees and practical performance are very different things. - regarding the statement of the update of the computational cost: while optimizing given \omega_n is cheap, sampling \omega_n for the other agent is not (requires drawing a sample from a MVN gaussian posterior, which essentially amounts to computing a root decomposition of that agent's Sigma_t^{-1}, which grows linearly in t) - while the authors mention privacy-preserving sharing of data (e.g. differential privacy), I feel the paper could benefit by further exploring these thoughts in some more detail. It seems like the data in this setting is small enough that that may well be feasible (compared to DNN-based FL approaches) - what is the horizontal line in Fig 2? - the landmine example seems very out of place: Why would anyone want to run a FL algorithm for landmine detection that avoids sharing the classifiers between landmine fields? - the paper makes a number of hyperbolic statements, that I don't think are appropriate for a research paper: "immense potential", "considerably superior" ********** Post-Rebuttal Comments ************** The authors' response did clarified a couple of the points I raised / misunderstandings I had. I am not particularly confident that using DIRECT for optimizing line 6 is the best choice, especially for higher-dimensional problems. I don't really feel swayed either way by the response.

Review 2

Summary and Contributions: This paper presents a Bayesian Optimization approach extending Thompson sampling to cases where several proxies of the objective function co-exist, and lead to as many Gaussian Process models relying on observations from the different proxies (one of them being the objective function itself). The main idea is to leverage this multiplicity while keeping respective observations private to the individual models, namely by exchanging information about GP parameters. Things are made particularly simple here by adopting shared random features, so that one ends up with Bayesian (Gaussian) linear models, and the information exchange amounts to communicating posterior distributions of model parameters, via posterior means and covariances in the considered Gaussian case. Then the modification of Thompson sampling is plainly done by sampling from the distribution of optimizers of one of the considered posterior GP models, either from the one associated with the objective function, or with some probability from one of the GP models associated with the proxies. Here the probability of choosing the first case increases to 1 at a prescribed pace, and the conditional probabilities of the choosing each of remaining proxy GPs otherwise are prescribed, typically uniform among the remaining models (or « agents ») at hand. A theoretical result is established in terms of cumulative regret, that guarantees sublinear growth (and quantifies it in more detail) with arbitrary high probability given an adequate scheme for the probability of choosing the agent associated with the actual objective function. Furthermore, it is shown on GP realizations and on several real-world experiments that i) the proposed « federated Thompson Sampling » may achieve fast decrease of the simple regret than standard Thompson Sampling and ii) can provide substantial wall-clock time savings.

Strengths: The idea of federating various agents in order to help and speed-up Bayesian Optimization is great and of huge potential usefulness in machine learning and in a variety of application fields. The approach introduced here is building up on Thompson sampling, one of the canonical Bayesian Optimization algorithms, and comes at the same time with theoretical guarantees (NB : I did not have a chance to thoroughly check the proof) and promising speed-ups observed in numerical experiments.

Weaknesses: It feels like the proposed approach could be cast as some specific instance of Thompson sampling scheme under a model mixture, and that by highlighting that one should probably in turn explore further the literature on BO under model mixtures. Also, it appears that the mixing scheme used here is quite simplistic and not accounting for the respective levels of fidelity of the different agents, offering considerable room for improvement (although incorporating this in the theoretical result might be quite involved). Besides this, while multi-fidelity / multi-source / multi-model BO has grown in the last years, there is not much account of that in the paper and it might well be relevant to better situate the contribution within these frameworks. Last but not least, I liked very much the idea of appealing to a common realization of random features, but I am also wondering to what extent this could be applied to more generic hyperparameters (e.g. kernel hyperparameters beyond coefficients in the linear case).

Correctness: I was a bit puzzled in lines 106-108 by the random kernel approximation followed by a so-called theoretical guarantee where I don’t really see how the randomness is accounted for. Also, in lines 131-132, the authors claim that “This further suggests that the absolute function values are upper bounded”; yet for this one requires a bounded kernel. Have according assumptions been made (e.g. compact domain and continuous kernel)? As for the correctness of the theoretical proposition, I did not have a chance to thoroughly check the proof. Further minor comments/questions follow: 1. “M \geq 1-dimensional random features“ in line 105 is kind of confusing 2. In line 114, in “which contain M2 and M parameters respectively”, employing “parameters” may be misleading 3. In line 299, “We treat one of the participants”, is the one fixed or random? More detail would help. In line 313, “Fig. 2 shows the (averaged) best performance”, could you explain in more detail why the focus is solely on the “best”? Is it simply the sense of taking the best observed point/response of a sequence as outcome?

Clarity: While I found the paper very well written overall, I have the feeling that I struggled a bit more than I should have trying to understand how the method works. Perhaps there is a way to formulate things in a simple manner, revolving around the idea that x^* is drawn at random from a mixture.

Relation to Prior Work: In relation to the previous points, I expect that once the proposed approach is seen as Thompson sampling under a mixture of models, a door is opened to the literature on related topics. I would be keen on hearing from the authors how they position their work with respect to this literature, that of course requires some additional bibliographical work. In a different flavour, while it is clear that the present approach offers so nice specificities related to information privacy, it still appears related to a stream of works around multi-fidelity BO, multi-source BO, multi-model BO, etc. and I think that an additional effort to situate the paper within this stream would potentially be worthwhile.

Reproducibility: Yes

Additional Feedback: Regarding privacy, it might be relevant to discuss to what extent knowing parameters of one or ther other agent may inform or not about them (i.e., even if the input/output tuples are unknown, one still gets some info). Added post-rebuttal: To me the rebuttal is fair.

Review 3

Summary and Contributions: This paper proposed a framework of federated Bayesian optimization for distributed collaborative zero-order optimization via Thompson sampling with approximation. Random Fourier features (RFF) approximation is adopted for GP to reduce the complexity. Theoretical analysis shows the convergence of FBO. Experiments with synthetic functions and real datasets validate the claims of faster convergence.

Strengths: * Novel framework of federated Bayesian optimization. * Theoretical analysis proves the convergence * Experiments validate the claims.

Weaknesses: * It seems quite computation- and communication- intensive and may limit its applicability in the real-world setting. * The idea is quite a natural extension into federated learning setting.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback:

Review 4

Summary and Contributions: The authors present an approach to perform Bayesian Optimization in a federated manner. The method applies in particular in settings, where no gradient information can be shared between agents, such as gradient-free hyper-parameter optimization for deep networks. Instead of passing direct information about a surrogate model (modeled as a Gaussian Process) between agents, the authors propose to instead pass random Fourier features of said GPs. To enable efficiency of communication between agents, federated Thompson sampling and random Fourier features are proposed as an effective solution. Finally, the authors provide a regret bound analysis applicable for agents with significantly different objective functions.

Strengths: The author's contribution is timely, relevant and within an area that is currently receiving significant attention. The main regret analysis theorem will be of interest to the community. The work is evaluated both on real and synthetic datasets.

Weaknesses: Approach-wise, the work is a combination and modification of three established techniques, i.e. mainly Bayesian Optimization with GPs, Random Fourier Features and a distributed Thompson Sampling approach. Experiments: The authors validate the convergence and data efficiency first on a simple synthetic GP model. Since GP generated data is likely easiest for a method based on GPs, another experiment with increasingly challenging synthetic optimization functions would have been of interest. Previous concerns highlighted in the first version of this review have been addressed by the authors during the rebuttal phase and this review has been updated accordingly.

Correctness: The overall methodology appears correct, but the reviewer did not have time to verify all the details of the proof of the main theorem in the appendix.

Clarity: The paper is well written and the method is clearly described.

Relation to Prior Work: The authors have addressed feedback comments on additional related work in their rebuttal.

Reproducibility: Yes

Additional Feedback: The reviewer thanks the authors for their modifications to the paper following the initial review.