{"title": "Learning search spaces for Bayesian optimization: Another view of hyperparameter transfer learning", "book": "Advances in Neural Information Processing Systems", "page_first": 12771, "page_last": 12781, "abstract": "Bayesian optimization (BO) is a successful methodology to optimize black-box functions that are expensive to evaluate. While traditional methods optimize each black-box function in isolation, there has been recent interest in speeding up BO by transferring knowledge across multiple related black-box functions. In this work, we introduce a method to automatically design the BO search space by relying on evaluations of previous black-box functions. We depart from the common practice of defining a set of arbitrary search ranges a priori by considering search space geometries that are learnt from historical data. This simple, yet effective strategy can be used to endow many existing BO methods with transfer learning properties. Despite its simplicity, we show that our approach considerably boosts BO by reducing the size of the search space, thus accelerating the optimization of a variety of black-box optimization problems. In particular, the proposed approach combined with random search results in a parameter-free, easy-to-implement, robust hyperparameter optimization strategy. We hope it will constitute a natural baseline for further research attempting to warm-start BO.", "full_text": "Learning search spaces for Bayesian optimization:\nAnother view of hyperparameter transfer learning\n\nValerio Perrone, Huibin Shen, Matthias Seeger, C\u00e9dric Archambeau, Rodolphe Jenatton\u2217\n\n{vperrone, huibishe, matthis, cedrica}@amazon.com\n\nAmazon\n\nBerlin, Germany\n\nAbstract\n\nBayesian optimization (BO) is a successful methodology to optimize black-box\nfunctions that are expensive to evaluate. While traditional methods optimize each\nblack-box function in isolation, there has been recent interest in speeding up BO by\ntransferring knowledge across multiple related black-box functions. In this work,\nwe introduce a method to automatically design the BO search space by relying\non evaluations of previous black-box functions. We depart from the common\npractice of de\ufb01ning a set of arbitrary search ranges a priori by considering search\nspace geometries that are learned from historical data. This simple, yet effective\nstrategy can be used to endow many existing BO methods with transfer learning\nproperties. Despite its simplicity, we show that our approach considerably boosts\nBO by reducing the size of the search space, thus accelerating the optimization of\na variety of black-box optimization problems. In particular, the proposed approach\ncombined with random search results in a parameter-free, easy-to-implement,\nrobust hyperparameter optimization strategy. We hope it will constitute a natural\nbaseline for further research attempting to warm-start BO.\n\n1\n\nIntroduction\n\nTuning the hyperparameters (HPs) of machine leaning (ML) models and in particular deep neural\nnetworks is critical to achieve good predictive performance. Unfortunately, the mapping of the HPs\nto the prediction error is in general a black-box in the sense that neither its analytical form nor its\ngradients are available. Moreover, every (noisy) evaluation of this black-box is time-consuming as\nit requires retraining the model from scratch. Bayesian optimization (BO) provides a principled\napproach to this problem: an acquisition function, which takes as input a cheap probabilistic surrogate\nmodel of the target black-box function, repeatedly scores promising HP con\ufb01gurations by performing\nan explore-exploit trade-off [30, 22, 37]. The surrogate model is built from the set of black-box\nfunction evaluations observed so far. For example, a popular approach is to impose a Gaussian\nprocess (GP) prior on the unobserved target black-box function f (x). Based on a set of evaluations\n{f (xi)}n\ni=1, possibly perturbed by Gaussian noise, one can compute the posterior GP, characterized\nby a posterior mean and a posterior (co)variance function. The next query points are selected by\noptimizing an acquisition function, such as the expected improvement [30], which is analytically\ntractable given these two quantities. While BO takes the human out of the loop in ML by automating\nHP optimization (HPO), it still requires the user to de\ufb01ne a suitable search space a priori. De\ufb01ning a\ndefault search space for a particular ML problem is dif\ufb01cult and left to human experts.\nIn this work, we automatically design the BO search space, which is a critical input to any BO\nprocedure applied to HPO, based on historical data. The proposed approach relies on the observation\nthat HPO problems occurring in ML are often related (for example, tuning the HPs of an ML model\n\n\u2217Work done while af\ufb01liated with Amazon; now at Google Brain, Berlin, rjenatton@google.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftrained on different data sets [43, 3, 51, 35, 11, 32, 9, 28]). Moreover, our method learns a suitable\nsearch space in a universal fashion: it can endow any BO algorithm with transfer learning capabilities.\nFor instance, we demonstrate this feature with three widely used HPO algorithms \u2013 random search [4],\nSMAC [20] and hyperband [29]. Further, we investigate the use of novel geometrical representations\nof the search spaces, departing from the traditional rectangular boxes. In particular, we show that an\nellipsoidal representation is not only simple to compute and manipulate, but leads to faster black-box\noptimization, especially as the dimension of the search space increases.\n\n2 Related work and contributions\n\nPrevious work has implemented BO transfer learning in many different ways. For instance, the\nproblem can be framed as a multi-task learning problem, where each run of BO corresponds to a task.\nTasks can be modelled jointly or as being conditionally independent with a multi-output GP [43], a\nBayesian neural network [41], a multi-layer perceptron with Bayesian linear regression heads [39, 32],\npossibly together with some embedding [28], or a weighted combination of GPs [35, 9]. Alternatively,\nseveral authors attempted to rely on manually de\ufb01ned meta-features in order to measure the similarity\nbetween BO problems [3, 51, 35]. If these problems further come in a speci\ufb01c ordering (e.g., because\nof successive releases of an ML model), the successive surrogate models can be \ufb01t to the residuals\nrelative to predictions of the previous learned surrogate model [15, 33]. In particular, if GP surrogates\nare used, the new GP is centered around the predictive mean of the previously learned GP surrogate.\nFinally, rather than \ufb01tting a surrogate model to all past data, some transfer can be achieved by\nwarm-starting BO with the solutions to the previous BO problems [10, 50].\nThe work most closely related to ours is [49], where the search space is pruned during BO, removing\nunpromising regions based on information from related BO problems. Similarity scores between BO\nproblems are computed from data set meta-features. While we also aim to restrict the BO search space,\nour approach is different in many ways. First, we do not require meta-features, which in practice can\nbe hard to obtain and need careful manual design. Second, our procedure works completely of\ufb02ine,\nas a preprocessing step, and does not require feedback from the black-box function being optimized.\nThird, it is parameter-free and model-free. By contrast, [49] rely on a GP model and have to select a\nradius and the fraction of the space to prune. Finally, [49] use a discretization step to prune the search\nspace, which may not scale well as its dimension increases. The generality of our approach is such\nthat [49] could be used on top of our proposed method (while the converse is not true).\nAnother line of research has developed search space expansion strategies for BO. Those approaches\nare less dependent on the initial search space provided by the users, incrementally expanding it\nduring BO [36, 31]. None of this research has considered transfer learning. A related idea to learn\nhyperparameter importance has been explored in [45], where a post-hoc functional ANOVA analysis\nis used to learn priors over the hyperparameter space. Again, such techniques could be used together\nwith our approach, which would de\ufb01ne the initial search space in a data driven manner.\nOur contributions are: (1) We introduce a simple and generic class of methods that design compact\nsearch spaces from historical data, making it possible to endow any BO method with transfer learning\nproperties, (2) we explore and demonstrate the value of new geometrical representations of search\nspaces beyond the rectangular boxes traditionally employed, and (3) we show over a broad set of\ntransfer learning experiments that our approach consistently boosts the performance of the optimizers\nit is paired with. When combined with random search, the resulting simple and parameter-free\noptimization strategy constitutes a strong baseline, which we hope will be adopted in future research.\n\n3 Black-box function optimization on a reduced search space\nConsider T + 1 black-box functions {ft(\u00b7)}T\nt=0, de\ufb01ned on a common search space X \u2286 Rp.\nThe functions are expensive-to-evaluate, possibly non-convex, and accessible only through their\nvalues, without gradient information. In loose terms, the functions {ft(\u00b7)}T\nt=0 are assumed related,\ncorresponding for instance to the evaluations of a given ML model over T + 1 data sets (in which\ncase X is the set of feasible HPs of this model). Our goal is to minimize f0:\n\n(1)\nHowever, for t \u2265 1, we have access to nt noisy evaluations of the function ft, which we denote by\nDt = {(xi,t, yi,t)}nt\ni=1. In this work, we consider methods that take the previous evaluations {Dt}T\n\nmin\nx\u2208X f0(x).\n\nt=1\n\n2\n\n\fas inputs, and output a search space \u02c6X \u2286 X , so that we solve the following problem instead of (1):\n(2)\n\nf0(x).\n\nmin\nx\u2208 \u02c6X\n\nThe local minima of (2) are a subset of the local minima of (1). Since \u02c6X is more compact (formally\nde\ufb01ned later), BO methods will \ufb01nd those minima faster, i.e., with fewer function evaluations. Hence,\nwe aim to design \u02c6X such that it contains a \u201cgood\u201d set of local minima, close to the global ones of X .\n\n4 Data-driven search space design via transfer learning\n\nt=1, respectively max{wt}T\n\nt ) as the element in Dt that reaches the smallest\nNotations and preliminaries. We de\ufb01ne (x(cid:63)\nt , y(cid:63)\nt ) = arg min(xt,yt)\u2208Dt yt. For any two vectors\n(i.e., best) evaluation for the black-box t, i.e., (x(cid:63)\nt , y(cid:63)\nu and w in Rp, u \u2264 w stands for the element-wise inequalities uj \u2264 wj for j \u2208 {1, . . . , p}. We also\ndenote by |u| the vector with entries |uj| for j \u2208 {1, . . . , p}. For a collection {wt}T\nt=1 of T vectors in\nRp, we denote by min{wt}T\nt=1, the p-dimensional vector resulting from\nthe element-wise minimum, respectively maximum, over the T vectors. Finally, for any symmetric\nmatrix A \u2208 Rp\u00d7p, A (cid:31) 0 indicates that A is positive de\ufb01nite.\nWe assume that the original search space X is de\ufb01ned by axis-aligned ranges that can be thought of\nas a bounding box: X = {x \u2208 Rp| l0 \u2264 x \u2264 u0} where l0 and u0 are the initial vectors of lower\nand upper bounds. Search spaces represented as boxes are commonly used in popular BO packages\nsuch as Spearmint [38], GPyOpt [1], GPflowOpt [27] , Dragonfly [24] and Ax [7].\nThe methodology we develop applies to numerical parameters (either integer or continuous). If\nthe problems under consideration exhibit categorical parameters (we have such examples in our\nexperiments, Section 6), we let X = Xcat \u00d7 Xnum. Our methodology then applies to Xnum only,\nkeeping Xcat unchanged. Hence, in the remainder the dimension p refers to the dimension of Xnum.\n\nQ(cid:0)\u03b8) such that for t \u2265 1, x(cid:63)\n\n4.1 Search space estimation as an optimization problem\nThe reduced search space \u02c6X we would like to learn is de\ufb01ned by a parameter vector \u03b8 \u2208 Rk. To\nestimate \u02c6X , we consider the following constrained optimization problem:\nt \u2208 \u02c6X (\u03b8),\n\n(3)\nwhere Q(\u03b8) is some measure of volume of \u02c6X (\u03b8); concrete examples are given in Sections 4.2 and 4.3.\nt}T\nIn solving (3), we \ufb01nd a search space \u02c6X (\u03b8) that contains all solutions {x(cid:63)\nt=1 to previously solved\nblack-box optimization problems, while at the same time minimizing Q(\u03b8). Note that \u02c6X (\u03b8) can only\nget larger (as measured by Q(\u03b8)) as more related black-box optimization problems are considered.\nMoreover, formulation (3) does not explicitly use the y(cid:63)\nt \u2019s and never compares them across tasks. As\na result, unlike previous work such as [51, 9], we need not normalize the tasks (e.g., whitening).\n\nmin\n\u03b8\u2208Rk\n\n4.2 Search space as a low-volume bounding box\n\nThe \ufb01rst instantiation of (3) is a search space de\ufb01ned by a bounding box (or hyperrectangle), which is\nparameterized by the lower and upper bounds l and u. More formally, \u02c6X (\u03b8) = {x \u2208 Rp| l \u2264 x \u2264 u}\nand \u03b8 = (l, u), with k = 2p. A tight bounding box containing all {x(cid:63)\nt}T\nt=1 can be obtained as the\nsolution to the following constrained minimization problem:\n\nmin\n\nl\u2208Rp, u\u2208Rp\n\n1\n2\n\n(cid:107)u \u2212 l(cid:107)2\n\n2 such that for t \u2265 1, l \u2264 x(cid:63)\n\nt \u2264 u\n\n(4)\n\nwhere the compactness of the search space is enforced by a squared (cid:96)2 term that penalizes large\nranges in each dimension. This problem has a simple closed-form solution \u03b8\u2217\n\nb = (l\u2217, u\u2217), where\n\nl\u2217 = min{x(cid:63)\n\nt}T\n\n(5)\nThese solutions are simple and intuitive: while the initial lower and upper bounds l0 and u0 may\nde\ufb01ne overly wide ranges, the new ranges of \u02c6X (\u03b8\u2217\nb ) are the smallest ranges containing all the\n\nt=1\n\nand u\u2217 = max{x(cid:63)\n\nt}T\nt=1.\n\n3\n\n\ft}T\nt=1. The resulting search space \u02c6X (\u03b8\u2217\n\nrelated solutions {x(cid:63)\nb ) de\ufb01nes a new, tight bounding box that\ncan directly be used with any optimizer operating on the original X . Despite the simplicity of the\nde\ufb01nition of \u02c6X (\u03b8\u2217\nb ), we show in Section 6 that this approach constitutes a surprisingly strong baseline,\neven when combined with random search only. We will generalize the optimization problem (4) in\nSection 5 to obtain solutions that account for outliers contained {x(cid:63)\nt}T\nt=1 and, as a result, produce an\neven tighter search space.\n\n4.3 Search space as a low-volume ellipsoid\n\nA\u2208Rp\u00d7p, A(cid:31)0, b\u2208Rp\n\nlog det(A\u22121) such that for t \u2265 1, (cid:107)Ax(cid:63)\n\nThe second instantiation of (3) is a search space de\ufb01ned by a hyperellipsoid (i.e., af\ufb01ne transformations\nof a unit (cid:96)2 ball), which is parameterized by a symmetric positive de\ufb01nite matrix A \u2208 Rp\u00d7p and an\noffset vector b \u2208 Rp. More formally, \u02c6X (\u03b8) = {x \u2208 Rp| (cid:107)Ax + b(cid:107)2 \u2264 1} and \u03b8 = (A, b), with\nk = p(p + 3)/2. Using the classical L\u00f6wner-John formulation [21], the lowest volume ellipsoid\nt}T\ncovering all points {x(cid:63)\nt=1 is the solution to the following problem (see Section 8.4 in [5]):\nmin\n\nincreasing function of the volume of the ellipsoid \u221d 1/(cid:112)det(A) [17]. This problem is convex, admits\n\n(6)\nt \u2208 \u02c6X (\u03b8), while the minimized objective is a strictly\ne = (A\u2217, b\u2217), and can be solved ef\ufb01ciently by interior-points algorithms [42]. In\n\nwhere the T norm constraints enforce x(cid:63)\na unique solution \u03b8\u2217\nour experiments, we use CVXPY [6].\nt}T\nIntuitively, an ellipsoid should be more suitable than a hyperrectangle when the points {x(cid:63)\nt=1 we\nwant to cover do not cluster in the corners of the box. In Section 6, a variety of real-world ML\nt}T\nproblems suggest that the distribution of the solutions {x(cid:63)\nt=1 supports this hypothesis. We will\nalso generalize the optimization problem (6) in Section 5 to obtain solutions that account for outliers\ncontained in {x(cid:63)\n\nt + b(cid:107)2 \u2264 1,\n\nt}T\nt=1.\n\n4.4 Optimizing over ellipsoidal search spaces\n\nWe cannot directly plug ellipsoidal search spaces into standard HPO procedures. Algorithm 1\ndetails how to adapt random search, and as a consequence also methods like hyperband [29], to an\nellipsoidal search space. In a nutshell, we use rejection sampling to guarantee uniform sampling in\nX \u2229 \u02c6X (\u03b8\u2217\ne ): we \ufb01rst sample uniformly in the p-dimensional ball, then apply the inverse mapping\nof the ellipsoid [12], and \ufb01nally check whether the sample belongs to X . The last step is important\nas not all points in the ellipsoid may be valid points in X . For example, a HP might be restricted\nto only take positive values. However, after \ufb01tting the ellipsoid, some small amount of its volume\nmight include negative values. Finally, ellipsoidal search spaces cannot directly be used with more\ncomplex, model-based BO engines, such as GPs. They would require resorting to constrained BO\nmodelling [16, 13, 14], e.g., to optimize the acquisition function over the ellipsoid, which would add\nsigni\ufb01cant complexity to the procedure. Hence, we defer this investigation to future work.\n\ne ,X )\n\n(cid:46) \u03b8\u2217\n\ne and IS_FEASIBLE \u2190 FALSE\n\nA\u2217, b\u2217 \u2190 \u03b8\u2217\nwhile not IS_FEASIBLE do\n\nAlgorithm 1 Rejection sampling algorithm to uniformly sample in an ellipsoidal search space\n1: procedure ELLIPSOIDRANDOMSAMPLING(\u03b8\u2217\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nz \u223c N (0, I), with z \u2208 Rp, and r \u223c U (0, 1)\nt \u2190 r1/p\n(cid:107)z(cid:107)2\nx \u2190 (A\u2217)\u22121(t \u2212 b\u2217)\nif x \u2208 X then\n\n(cid:46) x is uniformly distributed in the ellipsoid \u02c6X (\u03b8\u2217\n(cid:46) We check if x is valid since we may not have \u02c6X (\u03b8\u2217\n\n(cid:46) t is uniformly distributed in the unit (cid:96)2 ball [19]\ne ) [12]\ne ) \u2286 X\n\ne is the solution from Section 4.3\n\nIS_FEASIBLE \u2190 TRUE\n\nz\n\nreturn x\n\n5 Handling outliers in the historical data\n\nThe search space chosen by our method is the smallest hyperrectangle or hyperellipsoid enclosing\na set of solutions {x(cid:63)\nt}T\nt=1 found by optimizing related black-box optimization problems. In order\n\n4\n\n\fto exploit as much information as possible, a large number of related problems may be considered.\nHowever, the learned search space volume might increase as a result, which will make black-box\noptimization algorithms, such as BO, less effective. For example, if some of these problems depart\nsigni\ufb01cantly from the other black-box optimization problems, their contribution to the volume\nincrease might be disproportionate and discarding them will be bene\ufb01cial. In this section, we extend\nof our methodology to exclude such outliers automatically.\nWe allow for some x(cid:63)\nexclude outliers from the hyperrectangle, problem (4) is modi\ufb01ed as follows:\n\nt to violate feasibility, but penalize such violations by way of slack variables. To\n\nmin\n\u2212\nl\u2208Rp, u\u2208Rp, \u03be\nt \u22650, \u03be+\n\u2212\nt |l0|\u2264x(cid:63)\n\nt \u22650\nt \u2264u+\u03be+\n\nfor t\u22651, l\u2212\u03be\n\nt |u0|\n\n(cid:107)u \u2212 l(cid:107)2\n\n2 +\n\n\u03bbb\n2\n\n1\n2T\n\n(\u03be\u2212\nt + \u03be+\n\nt ),\n\n(7)\n\nwhere \u03bbb \u2265 0 is a regularization parameter, and {\u03be\u2212\nt=1 the slack variables associated\nrespectively to l and u, which we make scale-free by using |l0| and |u0|. Slack variables can also be\nused to exclude outliers from an ellipsoidal search region [26, 42] by rewriting (6) as follows:\n\nt=1 and {\u03be+\n\nt }T\n\nt }T\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nmin\nA\u2208Rp\u00d7p, A(cid:31)0, b\u2208Rp, \u03bet\u22650\nfor t\u22651, (cid:107)Ax(cid:63)\nt +b(cid:107)2\u22641+\u03bet\n\n\u03bbe log det(A\u22121) +\n\n1\nT\n\n\u03bet.\n\n(8)\n\nt}T\nt=1 are ignored, leading to a tighter search space.\n\nwhere \u03bbe \u2265 0 is a regularization parameter and {\u03bet}T\nt=1 the slack variables . Note that the original\nformulations (4) and (6) are recovered when \u03bbb or \u03bbe tend to zero, as the optimal solution is then\nfound when all the slack variables are equal to zero. By contrast, when \u03bbb or \u03bbe get larger, more\nsolutions in the set {x(cid:63)\nTo set \u03bbb and \u03bbe, we proceed in two steps. First, we compute the optimal solution Q(\u03b8\u2217) of the\noriginal problem, namely (4) and (6) for the bounding box and ellipsoid, respectively. Q(\u03b8\u2217) captures\nthe scale of the problem at hand. Then, we look at \u03bb = s/Q(\u03b8\u2217) for s in a small, scale-free, grid\nof values and select the smallest value of \u03bb that leads to no more than (1 \u2212 \u03bd) \u00d7 T solutions from\n{x(cid:63)\nt}T\nt=1 (by checking the number of active, i.e., strictly positive, slack variables in (7) and (8)). We\ntherefore turn the selection of the abstract regularization parameter \u03bb to the more interpretable choice\nof \u03bd as a fraction of outliers. In our experiments, we purely determined those values on the toy SGD\nsynthetic setting (Section 6.1), and then applied it as a default to the real-world problems with no\nextra tuning (this led to \u03bdb = 0.5 and \u03bde = 0.1).\n\n6 Experiments\n\nOur experiments are guided by three key messages. First, our method can be combined with a\nlarge number of HPO algorithms. Hence, we obtain a modular design for HPO (and BO) which is\nconvenient when building ML systems. Second, by introducing parametric assumptions (with the\nbox and the ellipsoid), we show empirically that our approach is more robust to low-data regimes\ncompared to model-based approaches. Third, our simple method induces transfer learning by reducing\nthe search space of BO. The method compares favorably to more complex alternative models for\ntransfer learning proposed in the literature, thus setting a competitive baseline for future research.\nIn the experiments, we consider combinations of search space de\ufb01nitions and HPO algorithms. Box\nand Ellipsoid refer to learned hyperrectangular (Section 4.2) and hyperellipsoidal search spaces\n(Section 4.3). When none of these pre\ufb01xes are used, we de\ufb01ned the search space using manually\nspeci\ufb01ed bounding boxes (see Subsections 6.1\u20136.3 for the speci\ufb01cs). We further consider a diverse\nset of HPO algorithms: random search, Hyperband [29], GP-based BO, GP-based BO with input\nwarping [40], random forest-based BO [20], and adaptive bayesian linear regression-based BO [32],\nwhich are denoted respectively by Random, HB, GP, GP warping, SMAC, and ABLR. We assessed the\ntransfer learning capabilities in a leave-one-task-out fashion, meaning that we leave out one of the\nblack-box optimization problems and then aggregate the results. Each curve and shaded area in the\nplots respectively corresponds to the mean metric and standard error obtained over 10 independent\nreplications times the number of leave-one-task-out runs. To report the results and ease comparisons\nof tasks, we normalize the performance curves of each model by the best value obtained by random\nsearch, inspired by [15, 44]. Consequently, each random search performance curve ends up at 1 (or 0\nwhen the metric is log transformed).\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: Tuning SGD for ridge regression. (a) Comparison of BO algorithms with Box transfer\nlearning counterparts. (b) Comparison of resource-aware BO with transfer learning counterparts Box\nand Ellipsoid. Note that HB with transfer outperforms all methods shown in (a).\n\n(a)\n\n(b)\n\nFigure 2: Visualization of the learned Ellipsoid search space (a) without and (b) with slack\nvariables. The blue dots are the observed evaluations and the orange dots are the samples drawn from\nthe learned Ellipsoid. The slack-extension successfully excludes the outlier learning rate.\n\n6.1 Tuning SGD for ridge regression\n\nWe consider the problem of tuning the parameters of stochastic gradient descent (SGD) when opti-\nmizing a set of 30 synthetic ridge regression problems with 81 input dimensions and 81 observations.\nThe setting is inspired by [44] and is described in the Supplement. The HPO problem consists in\ntuning 3 HPs: learning rate in the range of (0.001, 1.0), momentum in the range of (0.3, 0.999) and\nregularization parameter in the range of (0.001, 10.0). Figure 1a shows that the convergence to a\ngood local minimum of conventional BO algorithms, such as Random, GP, and SMAC, is signi\ufb01cantly\nboosted when the search space is learned (Box) from related problems. It is also interesting to note\nthat all perform similarly once the search space is learned. The results for GP warping combined\nwith Box are also similar, where the Box improves over GP warping but not as much as in the GP\ncase due to signi\ufb01cant performance gain with warping. We show the results in Supplement A.\nThe transfer learning methodology can be combined with resource-based BO algorithms, such as\nHyperband [29]. We de\ufb01ned a unit of resource as three SGD updates (following [44]). By design, the\nmore resources, the better the performance. Figure 1b shows that both the Box and Ellipsoid-based\ntransfer bene\ufb01t HB. Furthermore, HB with transfer is competitive with all other conventional BO\nalgorithms, including model-based ones when comparing the RMSE across Figure 1a and Figure 1b.\nWe then studied the impact of introducing the slack variables to exclude outliers. One example of a\nlearned Ellipsoid search space found for the 3 HPs of SGD on one of the ridge regression tasks\nis illustrated in Figure 2, together with its slack counterpart. The slack extension provides a more\ncompact search space by discarding the outlier with learning rate value \u2248 0.06. A similar result was\nobtained with the Box-based learned search space (see Supplement A).\n\n6.2 Tuning binary classi\ufb01ers over multiple OpenML data sets\n\nWe consider HPO for three popular binary classi\ufb01cation algorithms: random forest (RF; 5 HPs),\nsupport vector machine (SVM; 4 HPs), and extreme gradient boosting (XGBoost; 10 HPs). Here, each\nproblem consists of tuning one of these algorithms on a different data set. We leveraged OpenML [46],\n\n6\n\n01020304050iteration0.40.20.00.20.40.60.8log10(normalized RMSE)Aggregated Toy SGD resultsRandomEllipsoid + RandomBox + RandomGPBox + GPSMACBox + SMAC2.01.51.00.50.00.51.01.5log10 resources0.40.20.00.20.40.60.8log10(normalized RMSE)Aggregated Toy SGD resultsRandomHBBox + HBEllipsoid + HB\f(a)\n\n(b)\n\n(c)\n\nFigure 3: OpenML. (a) Performance of BO algorithms and their transfer learning counterparts.\n(b) Compares GP warm-start with Box + Random and Ellipsoid + Random. (c) Shows that\nBox + GP warm-start outperforms plain GP warm-start.\n\nwhich provides evaluation data for many ML algorithm and data set pairs. Following [32], we selected\nthe 30 most-evaluated data sets for each algorithm. The default search ranges were de\ufb01ned by the\nminimum and maximum hyperparameter value used by each algorithm when trained on any of the\ndata sets. Results are shown in Figure 3a. The Box variants consistently outperform their non-transfer\ncounterparts in terms of convergence speed, and Box Random performs en par with Box SMAC and\nBox GP. In this experiment, we also compared to GP with input warping [40], which exhibited a\ncomparable boost when used in conjunction with the Box search space, slightly outperforming all\nother methods. Finally, Ellipsoid Random slightly outperforms Box Random. Next, we compare\nEllipsoid Random and Box Random with two different transfer learning extensions of GP-based\nHPO, derived from [10]. Each black-box optimization problem is described by meta-features\ncomputed on the data set.2 In GP warm start, the closest problem (among the 29 left out) in terms\nof (cid:96)1 distance of meta-features is selected, and its k best evaluations are used to warm start the GP\nsurrogate. Results are given in Figure 3b. Our simple search space transfer learning techniques\noutperform these GP-based extensions for all k. We also considered GP warm start T=29, which\ntransfers the k best evaluations from all the 29 left out problems, appending the meta-feature vector to\nthe hyperparameter con\ufb01guration as input to the GP (see Figure B1a in Supplement B). Results were\nqualitatively very similar, but the cubic scaling of GP surrogates renders GP warm start T=29\nunattractive for a large T and/or k. In contrast, our transfer learning techniques are model-free.\nIn the next experiment, we combine our Box search space with these GP-based transfer techniques. In\nall methods, we use 256 random samples from each problem t: nt go to Box, the remaining 256 \u2212 nt\nto GP warm start. The results are given in Figure 3c. It is clear that the Box improves plain\nGP warm start regardless of nt. We also ran experiments with GP warm start T=29 (n=*) and\nABLR (n=*) [32], a transfer HPO method which scales linearly in the total number of evaluations (as\nopposed to GP, which scales cubically). In all cases, Box Random is signi\ufb01cantly outperformed by\nsome Box GP or Box ABLR variant, demonstrating that additional gains are achievable by using some\nof the data from the related problems to tighten the search space (see Figure B1b and Figure B1c\nin Supplement B). Next, we studied the effect of the number nt of samples from each problem on\nEllipsoid Random. Results are reported in Figure 4a. We found that with a small number (nt = 8)\nof samples per problem, learning the search space already provided sizable gains. We also studied the\neffects of the number of related problems.3 Results are given in Figure 4b. We see that our transfer\nlearning by reducing the search space performs well even if only 3 previous problems are available,\nwhile transfering from 9 problems yields most of the improvements. The results for Box Random are\nsimilar (see Figure B2a- B2b in Supplement B).\nFinally, we benchmark our slack variable extensions of Box and Ellipsoid from Section 5. As the\nnumber of related problems grows, the volume of the smallest box or ellipsoid enclosing all minima\nmay be overly large due to some outlier solutions. For example, we observed that the optimal learning\nrate \u03b7 of XGBoost is typically \u2264 0.3, except \u03b7 \u2248 1 for one data set. Our slack extensions are able\nto neutralize such outliers, reducing the learned search space and improving the performance (see\nFigure B3a and Figure B3b in Supplement B for more details).\n\n2 Four features: data set size; number of features; class imbalance; Naive Bayes landmark feature.\n3 For each of the 30 target problems, we pick k < 29 of the remaining ones at random, independently in\n\neach random repetition. We transfer nt = 256 samples per problem.\n\n7\n\n01020304050iteration0.10.00.10.20.30.40.5log10(normalized (1-AUC))Aggregated OpenML resultsRandomBox + RandomEllipsoid + RandomGPBox + GPSMACBox + SMAC01020304050iteration0.10.00.10.20.30.40.5log10(normalized (1-AUC))Aggregated OpenML resultsRandomBox + RandomEllipsoid + RandomGPGP warm-start (top 16)GP warm-start (top 32)GP warm-start (top 64)GP warm-start (top 128)01020304050iteration0.10.00.10.20.30.40.5log10(normalized (1-AUC))Aggregated OpenML resultsRandomBox + RandomGP warm-startBox + GP warm-start (n=255)Box + GP warm-start (n=248)Box + GP warm-start (n=192)\f(a)\n\n(b)\n\nFigure 4: OpenML. (a) Sample size complexity and (b) robustness to the number of related problems\nfor Ellipsoid Random.\n\n(a)\n\n(b)\n\nFigure 5: Feedforward neural network. (a) Performance of BO algorithms and their transfer learning\ncounterparts. (b) Comparison with resource-based BO.\n\n6.3 Tuning neural networks across multiple data sets\n\nThe last set of experiments we conduct consist of tuning the HPs of a 2-layer feed-forward neu-\nral network on 4 data sets [25], namely {slice localization, protein structure, naval\npropulsion, parkinsons telemonitoring}. The search space for this neural network contains\nthe initial learning rate, batch size, learning rate schedule, as well as layer-speci\ufb01c widths, dropout\nrates and activation functions, thus in total 9 HPs. All the HPs are discretized and there are in total\n62208 HP con\ufb01gurations which have been trained with ADAM with 100 epochs on the 4 data sets,\noptimizing the mean squared error. For each HP con\ufb01guration, the learning curves (both on training\nand validation set) and the \ufb01nal test metric are saved and provided publicly by the authors [25]. As a\nresult, we avoided re-evaluating the HPs, which signi\ufb01cantly reduced our experiment time.\nEach black-box optimization problem consists of tuning the neural network parameters over 1 data set\nafter using 256 evaluations randomly chosen from the remaining 3 data sets to learn the search space.\nThe default search space ranges are provided in [25]. We compared plain Random, SMAC, and GP to\ntheir variants based on Box. The results are illustrated in Figure 5a, where signi\ufb01cant improvements\ncan be observed. Ellipsoid Random also outperforms classic BO baselines such as GP and SMAC.\nFinally, Figure 5b demonstrates that HB can be further sped up by our transfer learning extensions.\nThese results also indicate that good solutions are typically found in the interior of the search space.\nTo see how much accuracy could potentially be lost compared to methods that search over the\nentire search space, we re-ran the OpenML and neural network experiments using 16 times as many\niterations (see Figure C1a and Figure C1b in Supplement C). Empirical evidence shows that excluding\nthe best solution is not a concern in practice, and that restricting the search space leads to considerably\nfaster convergence.\n\n7 Conclusions\n\nWe presented a novel, modular approach to induce transfer learning in BO. Rather than designing\na specialized multi-task model, our simple method automatically crafts promising search spaces\n\n8\n\n01020304050iteration0.10.00.10.20.30.40.5log10(normalized (1-AUC))Aggregated OpenML resultsRandomEllipsoid + Random (n=1)Ellipsoid + Random (n=2)Ellipsoid + Random (n=4)Ellipsoid + Random (n=8)Ellipsoid + Random (n=16)Ellipsoid + Random (n=32)Ellipsoid + Random (n=64)Ellipsoid + Random (n=128)Ellipsoid + Random (n=256)01020304050iteration0.10.00.10.20.30.40.5log10(normalized (1-AUC))Aggregated OpenML resultsRandomEllipsoid + Random (T=3)Ellipsoid + Random (T=9)Ellipsoid + Random (T=15)Ellipsoid + Random (T=21)Ellipsoid + Random (T=27)01020304050iteration0.500.250.000.250.500.751.001.251.50log10(normalized MSE)Aggregated Neural Network resultsRandomBox + RandomEllipsoid + RandomGPBox + GPSMACBox + SMAC2.01.51.00.50.00.51.01.5log10 resources0.500.250.000.250.500.751.001.251.50log10(normalized MSE)Aggregated Neural Network resultsRandomHBEllipsoid HBBox HB\fbased on previously run experiments. Over an extensive set of benchmarks, we showed that our\napproach signi\ufb01cantly speeds up the optimization, and can be seamlessly combined with a wide\nrange of existing BO techniques. Beyond those we used in our experiments, we can further mention\nrecent resource-aware optimizers [2, 8], evolutionary-based techniques [18, 34] and virtually any core\nimprovement of BO, be it related to the acquisition function [48] or ef\ufb01cient parallelism [23, 47].\nThe proposed method could be extended in a model-based fashion, allowing us to simultaneously\nsearch for the best HP con\ufb01gurations xt\u2019s for each data set, together with compact spaces contain-\ning all of these con\ufb01gurations. When evaluation data from a large number of tasks is available,\nheterogeneity in their minima may be better captured by employing a mixture of boxes or ellipsoids.\n\nReferences\n[1] GPyOpt:\n\nA\n\noptimization\nhttps://github.com/SheffieldML/GPyOpt, 2016.\n\nBayesian\n\nframework\n\nin\n\nPython.\n\n[2] B. Baker, O. Gupta, R. Raskar, and N. Naik. Accelerating neural architecture search using\n\nperformance prediction. Technical report, preprint Arxiv 1705.10823, 2017.\n\n[3] R. Bardenet, M. Brendel, B. K\u00e9gl, and M. Sebag. Collaborative hyperparameter tuning. In\nProceedings of the International Conference on Machine Learning (ICML), pages 199\u2013207,\n2013.\n\n[4] J. Bergstra, R. Bardenet, Y. Bengio, B. K\u00e9gl, et al. Algorithms for hyper-parameter optimization.\n\nIn Advances in Neural Information Processing Systems, volume 24, pages 2546\u20132554, 2011.\n\n[5] S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[6] S. Diamond, E. Chu, and S. Boyd. CVXPY: A Python-embedded modeling language for convex\n\noptimization, version 0.2. http://cvxpy.org/, May 2014.\n\n[7] Facebook. Ax, adaptive experimentation platform, 2019.\n\n[8] S. Falkner, A. Klein, and F. Hutter. Bohb: Robust and ef\ufb01cient hyperparameter optimization\nat scale. In Proceedings of the International Conference on Machine Learning (ICML), pages\n1436\u20131445, 2018.\n\n[9] M. Feurer, B. Letham, and E. Bakshy. Scalable meta-learning for Bayesian optimization using\nranking-weighted Gaussian process ensembles. In ICML 2018 AutoML Workshop, July 2018.\n\n[10] M. Feurer, T. Springenberg, and F. Hutter. Initializing Bayesian hyperparameter optimization via\nmeta-learning. In Proceedings of the Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence,\n2015.\n\n[11] N. Fusi and H. M. Elibol. Probabilistic matrix factorization for automated machine learning.\n\nTechnical report, preprint arXiv:1705.05355, 2017.\n\n[12] J. D. Gammell and T. D. Barfoot. The probability density function of a transformation-based\n\nhyperellipsoid sampling technique. Technical report, preprint arXiv:1404.1347, 2014.\n\n[13] J. Gardner, M. Kusner, Z. Xu, K. Weinberger, and J. Cunningham. Bayesian optimization\nwith inequality constraints. In Proceedings of the 31st International Conference on Machine\nLearning (ICML-14), pages 937\u2013945, 2014.\n\n[14] E. C. Garrido-Merch\u00e1n and D. Hern\u00e1ndez-Lobato. Predictive entropy search for multi-objective\n\nBayesian optimization with constraints. Technical report, preprint arXiv:1609.01051, 2016.\n\n[15] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google Vizier: A\nservice for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 1487\u20131495, 2017.\n\n[16] R. B. Gramacy and H. K. Lee. Optimization under unknown constraints. Technical report,\n\npreprint arXiv:1004.4027, 2010.\n\n9\n\n\f[17] M. Gr\u00f6tschel, L. Lov\u00e1sz, and A. Schrijver. Geometric algorithms and combinatorial optimiza-\n\ntion, volume 2. Springer Science & Business Media, 2012.\n\n[18] N. Hansen. The cma evolution strategy: A tutorial. Technical report, preprint arXiv:1604.00772,\n\n2016.\n\n[19] R. Harman and V. Lacko. On decompositional algorithms for uniform sampling from n-spheres\n\nand n-balls. Journal of Multivariate Analysis, 101(10):2297\u20132304, 2010.\n\n[20] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general\n\nalgorithm con\ufb01guration. In Proceedings of LION-5, page 507?523, 2011.\n\n[21] F. John. Extremum problems with inequalities as subsidiary conditions. In Studies and Essays\n\npresented to R. Courant on his 60th Birthday, pages 187\u2014204, 1948.\n\n[22] D. R. Jones, M. Schonlau, and W. J. Welch. Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global optimization, 13(4):455\u2013492, 1998.\n\n[23] K. Kandasamy, A. Krishnamurthy, J. Schneider, and B. Poczos. Asynchronous parallel Bayesian\n\noptimisation via thompson sampling. Technical report, preprint arXiv:1705.09236, 2017.\n\n[24] K. Kandasamy, K. R. Vysyaraju, W. Neiswanger, B. Paria, C. R. Collins, J. Schneider, B. Poczos,\nand E. P. Xing. Tuning hyperparameters without grad students: Scalable and robust Bayesian\noptimisation with Dragon\ufb02y. arXiv preprint arXiv:1903.06694, 2019.\n\n[25] A. Klein and F. Hutter. Tabular benchmarks for joint architecture and hyperparameter optimiza-\n\ntion. arXiv preprint arXiv:1905.04970, 2019.\n\n[26] E. M. Knorr, R. T. Ng, and R. H. Zamar. Robust space transformations for distance-based\noperations. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 126\u2013135. ACM, 2001.\n\n[27] N. Knudde, J. van der Herten, T. Dhaene, and I. Couckuyt. GP\ufb02owOpt: A Bayesian Optimization\n\nLibrary using TensorFlow. arXiv preprint \u2013 arXiv:1711.03845, 2017.\n\n[28] H. C. L. Law, P. Zhao, J. Huang, and D. Sejdinovic. Hyperparameter learning via distributional\n\ntransfer. Technical report, preprint arXiv:1810.06305, 2018.\n\n[29] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel\nbandit-based approach to hyperparameter optimization. Journal of Machine Learning Research,\n18(185):1\u201352, 2018.\n\n[30] J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the\n\nextremum. Towards Global Optimization, 2(117-129):2, 1978.\n\n[31] V. Nguyen, S. Gupta, S. Rana, C. Li, and S. Venkatesh. Filtering Bayesian optimization\napproach in weakly speci\ufb01ed search space. Knowledge and Information Systems, pages 1\u201329,\n2018.\n\n[32] V. Perrone, R. Jenatton, M. Seeger, and C. Archambeau. Scalable hyperparameter transfer\n\nlearning. In Advances in Neural Information Processing Systems (NIPS), 2018.\n\n[33] M. Poloczek, J. Wang, and P. I. Frazier. Warm starting Bayesian optimization. In Winter\n\nSimulation Conference (WSC), 2016, pages 770\u2013781. IEEE, 2016.\n\n[34] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. Technical report, preprint arXiv:1802.01548, 2018.\n\n[35] N. Schilling, M. Wistuba, and L. Schmidt-Thieme. Scalable hyperparameter optimization with\nproducts of Gaussian process experts. In Joint European Conference on Machine Learning and\nKnowledge Discovery in Databases, pages 33\u201348. Springer, 2016.\n\n[36] B. Shahriari, A. Bouchard-Cote, and N. Freitas. Unbounded Bayesian optimization via regular-\nization. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), pages 1168\u20131176, 2016.\n\n10\n\n\f[37] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of\nthe loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148\u2013175, 2016.\n\n[38] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning\nalgorithms. In Advances in Neural Information Processing Systems, pages 2960\u20132968, 2012.\n\n[39] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat,\nand R. Adams. Scalable Bayesian optimization using deep neural networks. In Proceedings of\nthe International Conference on Machine Learning (ICML), pages 2171\u20132180, 2015.\n\n[40] J. Snoek, K. Swersky, R. Zemel, and R. Adams. Input warping for Bayesian optimization of\nnon-stationary functions. In Proceedings of the International Conference on Machine Learning\n(ICML), pages 1674\u20131682, 2014.\n\n[41] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust\nBayesian neural networks. In Advances in Neural Information Processing Systems (NIPS),\npages 4134\u20134142, 2016.\n\n[42] P. Sun and R. M. Freund. Computation of minimum-volume covering ellipsoids. Operations\n\nResearch, 52(5):690\u2013706, 2004.\n\n[43] K. Swersky, J. Snoek, and R. P. Adams. Multi-task Bayesian optimization. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 2004\u20132012, 2013.\n\n[44] L. Valkov, R. Jenatton, F. Winkelmolen, and C. Archambeau. A simple transfer-learning\n\nextension of Hyperband. In NIPS Workshop on Meta-Learning, 2018.\n\n[45] J. N. van Rijn and F. Hutter. Hyperparameter importance across datasets. In Proceedings of the\n24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages\n2367\u20132376. ACM, 2018.\n\n[46] J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine\n\nlearning. ACM SIGKDD Explorations Newsletter, 15(2):49\u201360, 2014.\n\n[47] Z. Wang, C. Li, S. Jegelka, and P. Kohli. Batched high-dimensional Bayesian optimization via\n\nstructural kernel learning. Technical report, preprint arXiv:1703.01973, 2017.\n\n[48] J. Wilson, F. Hutter, and M. Deisenroth. Maximizing acquisition functions for Bayesian\noptimization. In Advances in Neural Information Processing Systems (NIPS), pages 9906\u20139917,\n2018.\n\n[49] M. Wistuba, N. Schilling, and L. Schmidt-Thieme. Hyperparameter search space pruning\u2013a\nnew component for sequential model-based hyperparameter optimization. In Machine Learning\nand Knowledge Discovery in Databases, pages 104\u2013119. Springer, 2015.\n\n[50] M. Wistuba, N. Schilling, and L. Schmidt-Thieme. Learning hyperparameter optimization\ninitializations. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE\nInternational Conference on, pages 1\u201310. IEEE, 2015.\n\n[51] D. Yogatama and G. Mann. Ef\ufb01cient transfer learning method for automatic hyperparameter\ntuning. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), pages 1077\u20131085, 2014.\n\n11\n\n\f", "award": [], "sourceid": 6946, "authors": [{"given_name": "Valerio", "family_name": "Perrone", "institution": "Amazon"}, {"given_name": "Huibin", "family_name": "Shen", "institution": "Amazon"}, {"given_name": "Matthias", "family_name": "Seeger", "institution": "Amazon"}, {"given_name": "Cedric", "family_name": "Archambeau", "institution": "Amazon"}, {"given_name": "Rodolphe", "family_name": "Jenatton", "institution": "Amazon"}]}