NeurIPS 2020

### Review 1

Summary and Contributions: This paper is about applying GPs and Bayesian optimization for online model selection, for example, deploying new Ad models and choosing the one that maximizes click. The approaches used: sparse GPs, and variational inference, are not new. The novelty is in the model and the application. Experimental results show clear improvement in performance. The paper is very well written, and was a pleasure to read.

Strengths: This paper proposes a theoretically well grounded model for accumulative metrics, and uses sparse GPs for scalability when optimizing the metrics. The case of non-linear binary feedback from GPs is also considered (although details on inference for this case is missing). The proposed model seems adequate in modeling practical online optimization problems. The experiments seem to be complete. Results show clear improvement in performance compared to the baselines.

Weaknesses: The m weakness is that methods applied to optimize the proposed model: VI and sparse GPs are not new, making the paper relatively straightforward.

Correctness: The technical details and the empirical evaluations seem correct, and extensive.

Clarity: Here are some suggestions to improve the clarity: Section 3.2: Explicitly mention that the epistemic or model uncertainty is being modeled here. More details about sparse GPs and VI can be provided in the appendix. Section 3.3: It can help to discuss the inference procedure (perhaps in the appendix) for the binary feedback case. Experiments: More experimental details should be provided. For instance, how is the probability matrix computed?

Reproducibility: Yes

Additional Feedback: Overall this is a nice paper, however more details should be added to the paper for better readability. Questions: How is the probability matrix P computed? Stylistic comment: Line 284, 289, 315, 318: The values should accompany the plot and should be in the figure captions. Some typos: Line 107: 'radial' Line 191: 'Imagine' Line 266: 'how OPE works' ======= Post rebuttal: My views about this paper are still positive, so I am keeping my score.

### Review 2

Summary and Contributions: This paper studied the problem of model selection via a Gaussian process. The authors proposed an automated online experimentation mechanism that can efficiently perform model selection from a large pool of models with a small number of online experiments.

Strengths: - The paper is well-written, well-structured and easy to follow. - Discussion on the differences among model selection algorithms, deep reinforcement algorithms, and bandit algorithms is provided. - Five baseline models are compared in the experiments to demonstrate the superiority of the proposed model selection algorithm.

Weaknesses: - Motivation of utilizing a Gaussian process but not other point process models, deep learning algorithms or traditional machine learning algorithms as the surrogate model for the distribution of the immediate feedback is not convincing. - Experiments were conducted by constructing two simulators but not the real human action records to perform the evaluation, making the experimental results not very convincing. - Source code of the proposed model is not publicly available, making the re-implementation of the proposed model challenging. - The proposed model is not time-aware, i.e., the proposed model doesn’t take time information into account. The most recent feedback may influence more on the model selection, compared to the out-of-date feedback.

Correctness: As far as I can see, there is nothing incorrect with the paper.

Clarity: This paper is generally well-written and structured well.

Relation to Prior Work: There are some previous works that integrating Gaussion process and learning (e.g., model selection). Some references on the integration of GP and learning algorithms can be included in the related work section.

Reproducibility: No

Additional Feedback: Code of the proposed model should be publicly available to the others.

### Review 3

Summary and Contributions: This paper studies how to improve the efficiency of model selection for the production system. The authors propose an automated online experimentation mechanism (AOE) for model selection with few online experiments. They construct two synthetic experiments based on real data and demonstrate the effectiveness of the mechanism.

Strengths: 1. The authors study an important problem and the proposed model can efficiently perform model selection from a large pool of models with a small number of online experiments which is critical for the production system. 2. The authors propose using a Gaussian process to model the feedback, including a feedback (reward) model and a noise model which is a reasonable idea. 3. Authors model the uncertainty of the feedbacks which is also important to the production system.

Weaknesses: 1. The proposed Bayesian surrogate model contains two parts: 1) a GP model that captures the noise-free'' component of the immediate feedback, and 2) a noise distribution used to absorb all the stochasticity. It's not clear what are the noises? How do they affect the feedback? It would help better understand the model if authors can provide some examples. 2. For the uncertainty, the authors should explain what kind of uncertainty could be extracted from the noises. 3. The synthetic experiments look good, but it's not convincing if there is no real world experiments in production systems.

Correctness: Yes.

Clarity: The paper writing can be improved.

Relation to Prior Work: Yes.

Reproducibility: Yes

### Review 4

Summary and Contributions: The paper proposes a model selection algorithm called Model Selection with Automated Online Experiments (AOE) that is designed for use in production systems. In the problem statement, it is stated that the goal of the model selection problem is to select the model from a set of candidate models that maximises a metric of interest. It is assumed that the metric of interest can be expressed as the average immediate feedback from each of a model's predictions. AOE uses both historical log data and data collected from a small budget of online experiments to inform the choice of model. A distribution for the accumulative metric, or expected immediate feedback, is derived. It contains the distribution of inputs to the model, the distribution of model predictions conditioned on inputs and the distribution of the immediate feedback conditioned on inputs and predictions. The distribution of the immediate feedback is learned by a Bayesian surrogate model. The surrogate model is first trained on historical log data. Models are then selected sequentially for online experiments using an acquisition function. The data collected from each online experiment is used to update the surrogate model. This method is similar to Bayesian optimisation, but it is subtly different. Whereas Bayesian optimisation would use a surrogate model to model the metric conditioned on the choice of model, the surrogate model of the proposed method models the immediate feedback conditioned on inputs to the model and predictions returned by the model. There are model selection experiments in which AOE outperformed five baseline methods including Bayesian optimisation, which performed poorly in this setting. Contributions of the paper include: a derivation of the distribution of the accumulative metric; a method for approximating this distribution with a Bayesian surrogate model; a method for model selection for production systems that can utilise historical log data and data from online experiments.

Strengths: The idea of using both historical log data and online experimental data to inform the model selection process is novel. The method for approximating the distribution of the metric based on the surrogate model of the immediate feedback is also novel. There has been interest in model selection and related problems such as hyperparameter optimisation and neural architecture search, so it is likely that this paper would be of interest to the NeurIPS community. The experimental results are strong. In both the classification and recommender system experiments, AOE found models with an accumulative metric score that was closer to the optimal accumulative metric score. In the same experiments, AOE was able to predict the value of the accumulative metric with lower root mean squared error. The AOE method appears to be applicable to a wide range of model selection problems since it can be used with both binary immediate feedback and real-valued immediate feedback. The only restrictions are that the metric of interest must be expressible as the average of this immediate feedback. However, the paper explains that this restriction is not very restrictive and cites model selection for recommender systems as a use case where the click-through rate of users is the average of the immediate feedback.

Weaknesses: In the description of the AOE method it is stated that the immediate feedback can be real-valued or binary. However, there were no experiments where the immediate feedback was real-valued. It would have been nice to see an experiment with real-valued immediate feedback to verify that the AOE method still compares as favourably to the baseline methods when the immediate feedback is real-valued. The paper uses a Gaussian process (GP) as the surrogate model. Since the amount of historical log data is potentially large, the GP is a sparse GP where the predicted variance is computed with the FITC approximation. Based on the experimental results, the sparse GP appears to work well. However, a class of probabilistic model that scales to large amounts of data without such sparsification would perhaps be a more suitable choice for the surrogate model. For example, a paper from the NeurIPS 2016 conference called “Bayesian Optimisation with Robust Bayesian Neural Networks” demonstrated that Bayesian neural networks can be used as a surrogate model instead of GPs to allow Bayesian optimisation to scale to more function evaluations. Perhaps, a Bayesian neural network surrogate model would be appropriate in this setting as well. --- POST REBUTTAL --- I read the author response and the other reviews. The authors have successfully addressed my concerns about a few minor issues. I keep my initial view on the paper that this is an interesting, novel, and solid piece of work, as well as my initial score.

Correctness: After reading through the paper twice, the claims, the methods and the empirical evaluation all appear to be correct.

Clarity: The paper is very well written. The order of all the sections and subsections worked well. Each one followed on nicely from the last. The descriptions of the problem and the method were all clear.

Relation to Prior Work: The relation to prior work is clearly discussed. The paper discusses similarities and differences between AOE and two other approaches to model selection. The first is A/B testing, which like AOE uses online experiments to inform the choice of model, but unlike AOE does not use historical log data. The second approach is off-policy evaluation (OPE). OPE uses only historical log data to estimate the metric of interest for candidate models. The paper highlights issues with these two competing approaches that are addressed by AOE. Finding the best model with A/B testing may require many online experiments. However in practice, online testing of candidate models can be time-consuming or subject to resource constraints which means the number of online experiments that can be afforded is usually very small. On the other hand, OPE can predict the metric of interest very well for candidate models that behave similarly to the model used to collect the log data, but may struggle to do so when a candidate model behaves very differently to the log data model. To address these issues AOE combines the efficiency of OPE by first learning from historical log data and the more direct measure of a candidate model used in A/B testing by utilising a small number of online experiments. In addition, the paper mentions problems in the literature that share some similarities with the model selection problem. Among these are hyperparameter optimisation and neural architecture search.

Reproducibility: Yes