NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3687
Title:Hyperparameter Learning via Distributional Transfer

Reviewer 1

[I have read the author rebuttal. Since it does address my essential concern of whether the performance observed in the paper generalizes, I've upgraded my score.] Summary: I liked the step in the direction of a more principled and generic approach for leveraging dataset characteristics for hyperparameter selection. However, the limitation to a single model class and instance space appear to constrain broad applicability. Moreover, even in this restricted setting, there is a fairly large design space to consider: embedding distribution or combination of, neural network architecture(s) for the embeddings and potentially for the BLR data matrix, acquisition function settings, optimization settings. This makes we wonder how easy the approach would be to apply generally. A broader experimental section would help assuage this concern. Originality: The incorporation of sampling distribution embeddings within BO for hyperparameter selection is novel to my knowledge. Quality: The two toy examples are nice proof-of-concepts, but a broader set of experimental results on interesting real-world scenarios would go a long way in providing convincing evidence that the restricted setting and various design decisions aren't issues. The results on the real dataset are positive for the proposed method, but underwhelming in that the easier utilization of a kernel over dataset meta-features (manualGP) and warm-starting with dataset meta-features (initGP) also perform fairly well in comparison. How did the authors make their decisions on the neural network architecture, acquisition function, and optimization settings? Clarity: I don't have major comments here. I found the paper fairly easy to follow. Also, even though I have concern regarding the design space, I would like to point out that the authors have provided a lot of detail on their experimental setup.

Reviewer 2

The paper proposes to transfer information across tasks using learnt representations of training datasets used in those tasks. This results in a joint Gaussian process model on hyperparameters and data representations. The developed method has a faster convergence compared to existing baselines, in some cases requiring only a few evaluations of the target objective. Through experiments, the authors show that it is possible to borrow strength between multiple hyperparameter learning tasks by making use of the similarity between training datasets used in those tasks. This helps developing the new method which finds a favourable setting of hyperparameters in only a few evaluations of the target objective.

Reviewer 3

This paper proposed a novel method for transfer learning in Bayesian hyperparameter optimization based on the theory that the distributions of previously observed datasets contain significant information that should not be ignored during hyperparameter optimization on a new dataset. They propose solutions to compare different datasets through distribution estimation and then combine this information with the classical Bayesian hyperparameter optimization setup. Experiments show that the method outperforms selected baselines. Originality: the method is novel, although it mostly bridges ideas from various fields. Quality: I would like to congratulate the authors on a very well written paper. The quality is very high throughout. Clarity: The paper is clear in general, but left me with some questions. For instance, what criterion is used to learn the NN embeddings of data distributions (Section 4.1)? Is it the marginal likelihood of the distGP/distBLR? Is it learned at the same time as the other parameters mu, nu and sigma? I am not deeply familiar with the Deep Kernel Learning paper but in there it seems they do use marginal likelihood. Significance: I think the paper has the potential to have a high impact in the field. I would have liked more extensive experiments to showcase this, e.g., evaluating on extensive benchmarks such as the ones in [Perrone et al. 2018]. Is there any specific reason why the ADAM optimizer was chosen for the hyperparameters (asking because most of the literature seems to have settled on L-BFGS to get rid of the learning rate)? I will finally mention in passing that the appendix was referenced a total of 14 times in the paper, and while this has no bearing on my final score for the paper, it does feel a bit heavy to have to go back and forth between the paper and the appendix. POST REBUTTAL: I have read the rebuttal, and thank the authors for the clarifications -- it seems I had misunderstood some parts of the paper. I still think this is a decent addition to make to the literature on hyperparameter optimization. I have seen instances of problems where this solution would be applicable.