{"title": "Hyperparameter Learning via Distributional Transfer", "book": "Advances in Neural Information Processing Systems", "page_first": 6804, "page_last": 6815, "abstract": "Bayesian optimisation is a popular technique for hyperparameter learning but typically requires initial exploration even in cases where similar prior tasks have been solved. We propose to transfer information across tasks using learnt representations of training datasets used in those tasks. This results in a joint Gaussian process model on hyperparameters and data representations. Representations make use of the framework of distribution embeddings into reproducing kernel Hilbert spaces. The developed method has a faster convergence compared to existing baselines, in some cases requiring only a few evaluations of the target objective.", "full_text": "Hyperparameter Learning via Distributional Transfer\n\nHo Chung Leon Law\u21e4\nUniversity of Oxford\n\nho.law@stats.ox.ac.uk\n\nPeilin Zhao\u21e4\nTencent AI Lab\n\nmasonzhao@tencent.com\n\nLucian Chan\n\nUniversity of Oxford\n\nleung.chan@stats.ox.ac.uk\n\nJunzhou Huang\nTencent AI Lab\n\njoehhuang@tencent.com\n\nDino Sejdinovic\u21e4\nUniversity of Oxford\n\ndino.sejdinovic@stats.ox.ac.uk\n\nAbstract\n\nBayesian optimisation is a popular technique for hyperparameter learning but typi-\ncally requires initial exploration even in cases where similar prior tasks have been\nsolved. We propose to transfer information across tasks using learnt representations\nof training datasets used in those tasks. This results in a joint Gaussian process\nmodel on hyperparameters and data representations. Representations make use of\nthe framework of distribution embeddings into reproducing kernel Hilbert spaces.\nThe developed method has a faster convergence compared to existing baselines, in\nsome cases requiring only a few evaluations of the target objective.\n\n1\n\nIntroduction\n\nHyperparameter selection is an essential part of training a machine learning model and a judicious\nchoice of values of hyperparameters such as learning rate, regularisation, or kernel parameters is what\noften makes the difference between an effective and a useless model. To tackle the challenge in a\nmore principled way, the machine learning community has been increasingly focusing on Bayesian\noptimisation (BO) [34], a sequential strategy to select hyperparameters \u2713 based on past evaluations\nof model performance. In particular, a Gaussian process (GP) [31] prior is used to represent the\nunderlying accuracy f as a function of the hyperparameters \u2713, whilst different acquisition functions\n\u21b5(\u2713; f ) are proposed to balance between exploration and exploitation. This has been shown to give\nsuperior performance compared to traditional methods [34] such as grid search or random search.\nHowever, BO suffers from the so called \u2018cold start\u2019 problem [28, 38], namely, initial observations\nof f at different hyperparameters are required to \ufb01t a GP model. Various methods [38, 6, 36, 28]\nwere proposed to address this issue by transferring knowledge from previously solved tasks, however,\ninitial random evaluations of the models are still needed to consider the similarity across tasks. This\nmight be prohibitive: evaluations of f can be computationally costly and our goal may be to select\nhyperparameters and deploy our model as soon as possible. We note that treating f as a black-box\nfunction, as is often the case in BO, is ignoring the highly structured nature of hyperparameter\nlearning \u2013 it corresponds to training speci\ufb01c models on speci\ufb01c datasets. We make steps towards\nutilizing such structure in order to borrow strength across different tasks and datasets.\nContribution. We consider a scenario where a number of tasks have been previously solved and we\npropose a new BO algorithm, making use of the embeddings of the distribution of the training data\n\n\u21e4Corresponding authors\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[4, 23]. In particular, we propose a model that can jointly model all tasks at once, by considering an\nextended domain of inputs to model accuracy f, namely the distribution of the training data PXY ,\nsample size of the training data s and hyperparameters \u2713. Through utilising all seen evaluations\nfrom all tasks and meta-information, our methodology is able to learn a useful representation of\nthe task that enables appropriate transfer of information to new tasks. As part of our contribution,\nwe adapt our modelling approach to recent advances in scalable hyperparameter transfer learning\n[26] and demonstrate that our proposed methodology can scale linearly in the number of function\nevaluations. Empirically, across a range of regression and classi\ufb01cation tasks, our methodology\nperforms favourably at initialisation and has a faster convergence compared to existing baselines \u2013 in\nsome cases, the optimal accuracy is achieved in just a few evaluations.\n\n2 Related Work\n\nThe idea of transferring information from different tasks in the context of hyperparameter learning\nhas been studied in various settings [38, 6, 36, 28, 43, 26]. Amongst this literature, one common\nfeature is that the similarity across tasks is captured only through the evaluations of f. This implies\nthat suf\ufb01cient evaluations from the task of interest is necessary, before we can transfer information.\nThis is problematic, if model training is computationally expensive and our goal is to employ our\nmodel as quickly as possible. Further, the hyperparameter search for a machine learning model in\ngeneral is not a black-box function, as we have additional information available: the dataset used in\ntraining. In our work, we aim to learn feature representation of training datasets in-order to yield\ngood initial hyperparameter candidates without having seen any evaluations from our target task.\nWhile such use of such dataset features, called meta-features, has been previously explored, current\nliterature focuses on handcrafted meta-features2. These strategies are not optimal, as these meta-\nfeatures can be be very similar, while having very different fs, and vice versa. In fact a study on\nOpenML [40] meta-features have shown that the optimal set depends on the algorithm and data [39].\nThis suggests that the reliance on these features can have an adverse effect on exploration, and we\ngive an example of this in section 5. To avoid such shortcomings, given the same input space, our\nalgorithm is able to learn meta-features directly from the data, avoiding such potential issues.\nAlthough [15] previously have also proposed to learn the meta-feature representations (for image data\nspeci\ufb01cally), their proposed methodology requires the same set of hyperparameters to be evaluated for\nall previous tasks. This is clearly a limitation considering that different hyperparameter regions will\nbe of interest for different tasks, and we would thus require excessive exploration of all those different\nregions under each task. To utilise meta-features, [15] propose to warm-start Bayesian optimisation\n[10, 32, 8] by initialising with the best hyperparameters from previous tasks. This also might be\nsub-optimal as we neglect non-optimal hyperparameters that can still provide valuable information\nfor our new task, as we demonstrate in section 5. Our work can be thought of to be similar in spirit to\n[17], which considers an additional input to be the sample size s, but do not consider different tasks\ncorresponding to different training data distributions.\n\n3 Background\n\nOur goal is to \ufb01nd:\n\n\u2713\u21e4target = argmax\u27132\u21e5f target(\u2713)\n\nwhere f target is the target task objective we would like to optimise with respect to hyperparameters \u2713.\nIn our setting, we assume that there are n (potentially) related source tasks f i, i = 1, . . . n, and for\nk denotes a noisy evaluation\neach f i, we assume that we have {\u2713i\nk) and Ni denotes the number of evaluations of f i from task i. Here, we focus on the case\nof f i(\u2713i\nthat f i(\u2713) is some standardised accuracy (e.g. test set AUC) of a trained machine learning model\nwith hyperparameters \u2713 and training data Di = {xi\n` 2 Rp are the covariates, yi\n`\nare the labels and si is the sample size of the training data. For a general framework, Di is any input\nto f i apart from \u2713 (can be unsupervised) \u2013 but following a typical supervised learning treatment, we\nassume it to be an i.i.d. sample from the joint distribution PXY . For each task we now have:\n\nk=1 from past runs, where zi\n\n`}si\n`=1, where xi\n\nk}Ni\n\nk, zi\n\n`, yi\n\ni = 1, . . . n\n\n(f i, Di = {xi\n\n`, yi\n\n`}si\n`=1,{\u2713i\n\nk, zi\n\nk}Ni\n\nk=1),\n\n2A comprehensive survey on meta-learning and handcrafted meta-features can be found in [13, Ch.2], [8]\n\n2\n\n\fOur strategy now is to measure the similarity between datasets (as a representation of the task itself), in\norder to transfer information from previous tasks to help us quickly locate \u2713\u21e4target. In order to construct\nmeaningful representations and measure between different tasks, we will make the assumption that\n` 2X and yi\n` 2Y for all i, and that throughout the supervised learning model class is the same.\nxi\nWhile this setting might seem limiting, there are many examples of practical applications, including\nride-sharing, customer analytics model and online inventory system [6, 28]. In all these cases, as new\ndata becomes available, we might want to either re-train our model or re-\ufb01t our parameters of the\nsystem to adapt to a speci\ufb01c distributional data input. In section 5.3, we further demonstrate that our\nmethodology is applicable to a real life protein-ligand binding problem in the area of drug design,\nwhich typically require signi\ufb01cant efforts to tune hyperparameters of the models for different targets\n[33].\nIntuitively, this assumption implies that the source of differences of f i(\u2713) across i and f target(\u2713) is\nin the data Di and Dtarget. To model this, we will decompose the data Di into the joint distribution\nXY ) and the sample size si for task i. Sample\nXY of the training data (Di = {xi\nP i\nsize3 is important here as it is closely related to model complexity choice which is in turn closely\nrelated to hyperparameter choice [17]. While we have chosen to model Di as P i\nXY and si, in practice\nthrough simple modi\ufb01cations of the methodology we propose, it is possible to model Di as a set [44].\nUnder this setting, we will consider f (\u2713, PXY , s), where f is a function on hyperparameters \u2713, joint\ndistribution PXY and sample size s. For example, f could be the negative empirical risk, i.e.\n\ni.i.d.\u21e0P i\n\n`}si\n\n`, yi\n\n`=1\n\nL(h\u2713(x`), y`)),\n\nf (\u2713, PXY , s) = \n\nsX`=1\nwhere L is the loss function and h\u2713 is the model\u2019s predictor. To recover f i and f target, we can evaluate\nXY , si), f target(\u2713) = f (\u2713, Ptarget\nat the corresponding PXY and s, i.e. f i(\u2713) = f (\u2713, P i\nXY , starget). In this\nform, we can see that similarly to assuming that f varies smoothly as a function of \u2713 in standard\nBO, this model also assumes smoothness of f across PXY as well as across s following [17]. Here\nwe can see that if two distributions and sample sizes are similar (with respect to a distance of their\nrepresentations that we will learn), their corresponding values of f will also be similar. In this source\nand target task setup, this would suggest we can selectively utilise information from previous source\ndatasets evaluations {\u2713i\n4 Methodology\n\nk=1 to help us model f target.\n\nk}Ni\n\nk, zi\n\n1\ns\n\n4.1 Embedding of data distributions\nTo model PXY , we will construct (D), a feature map on joint distributions for each task, estimated\nthrough its task\u2019s training data D. Here, we will follow [4] which considers transfer learning, and\nmake use of kernel mean embedding to compute feature maps of distributions (cf. [23] for an\noverview). We begin by considering various feature maps of covariates and labels, denoting them by\nx(x) 2 Ra, y(y) 2 Rb and xy([x, y]) 2 Rc, where [x, y] denotes the concatenation of covariates\nx and label y. Depending on the different scenarios, different quantities will be of interest.\nMarginal Distribution PX. Modelling of the marginal distribution PX is useful, as we might expect\nvarious tasks to differ in the distribution of x and hence in the hyperparameters \u2713, which, for example,\nmay be related to the scales of covariates. We also might \ufb01nd that x is observed with different\nlevels of noise across tasks. In this situation, it is natural to expect that those tasks with more noise\nwould perform better under a simpler, more robust model (e.g. by increasing `2 regularisation in the\nobjective function). To embed PX, we can estimate the kernel mean embedding \u00b5PX [23] with D by:\n\n (D) = \u02c6\u00b5PX =\n\n1\ns\n\nx(x`)\n\nsX`=1\n\nwhere (D) 2 Ra is an estimator of a representation of the marginal distribution PX.\nConditional Distribution PY |X. Similar to PX, we can also embed the conditional distribution\nPY |X. This is an important quantity, as across tasks, the form of the signal can shift. For example, we\n3Following [17], in practice we re-scale s to [0, 1], so that the task with the largest sample size has s = 1.\n\n3\n\n\fmight have a latent variable W that controls the smoothness of a function, i.e. P i\nY |X = PY |X,W =wi.\nIn a ridge regression setting, we will observe that those tasks (functions) that are less smooth would\nrequire a smaller bandwidth in order to perform better. For regression, to model the conditional\ndistribution, we will use the kernel conditional mean operator CY |X [35] estimated with D by:\n\n\u02c6CY |X = >y (x>x + I)1x = 1>y (I x(I + >x x)1>x )x\n\nwhere x = [x(x1), . . . , x(xs)]T 2 Rs\u21e5a, y = [y(y1), . . . , y(ys)]T 2 Rs\u21e5b and is a\nregularisation parameter that we learn. It should be noted the second equality [31] here allows us\nto avoid the O(s3) arising from the inverse. This is important, as the number of samples s per task\ncan be large. As \u02c6CY |X 2 Rb\u21e5a, we will \ufb02atten it to obtain (D) 2 Rab to obtain a representation\nof PY |X. In practice, as we rarely have prior insights into which quantity is useful for transferring\nhyperparameter information, we will model both the marginal and conditional distributions together\nby concatenating the two feature maps above. The advantage of such an approach is that the learning\nalgorithm does not have to itself decouple the overall representation of training dataset into the\ninformation about marginal and conditional distributions which is likely to be informative.\nJoint Distribution PXY . Taking an alternative and a more simplistic approach, it is also possible to\nmodel the joint distribution PXY directly. One approach is to compute the kernel mean embedding,\nbased on concatenated samples [x, y], considering the feature map xy. Alternatively, we can also\nembed PXY using the cross covariance operator CXY [11], estimated by D with:\n\n\u02c6CXY =\n\n1\ns\n\nx(x`) \u2326 y(y`) =\n\n1\ns\n\n>x y 2 Ra\u21e5b.\n\nsX`=1\n\nwhere \u2326 denotes the outer product and similarly to CY |X, we will \ufb02atten it to obtain (D) 2 Rab.\nAn important choice when modelling these quantities is the form of feature maps x, y and xy,\nas these de\ufb01ne the corresponding features of the data distribution we would like to capture. For\nexample x(x) = x and x(x) = xx> would be capturing the respective mean and second moment\nof the marginal distribution Px. However, instead of de\ufb01ning a \ufb01xed feature map, here we will opt\nfor a \ufb02exible representation, speci\ufb01cally in the form of neural networks (NN) for x, y and xy\n(except y for classi\ufb01cation4), in a similar fashion to [42]. To provide a better intuition on this choice,\nsuppose we have two task i, j and that P i\nXY (with the same sample size s). This will imply\nthat f i \u21e1 f j, and hence \u2713\u21e4i \u21e1 \u2713\u21e4j . However, the converse does not hold in general: f i \u21e1 f j does not\nnecessary imply P i\nXY . For example, regularisation hyperparameters of a standard machine\nlearning model are likely to be robust to rotations and orthogonal transformations of the covariates\n(leading to a different PX). Hence, it is important to de\ufb01ne a versatile model for (D), which can\nyield representations invariant to variations in the training data irrelevant for hyperparameter choice.\n\nXY \u21e1P j\n\nXY \u21e1P j\n\n4.2 Modelling f\nGiven (D), we will now construct a model\n\nk,P i\n\nXY , si), zi\n\nn{(\u2713i\n\nk=1on\nk}Ni\n\nfor f (\u2713, PXY , s), given observations\n, along with any observations on the target. Note that we will in-\nterchangeably use the notation f to denote the model and the underlying function of interest. We will\nnow focus on the algorithms distGP and distBLR, with additional details in Appendix A.\nGaussian Processes (distGP). We proceed similarly to standard BO [34] using a GP to model f and\na normal likelihood (with variance 2 across all tasks5) for our observations z,\n\ni=1\n\nf \u21e0 GP (\u00b5, C)\n\nz| \u21e0N (f (), 2)\n\nwhere here \u00b5 is a constant, C is the corresponding covariance function on (\u2713, PXY , s) and is a\nparticular instance of an input. In order to \ufb01t a GP with inputs (\u2713, PXY , s), we use the following C:\n\nC({\u27131,P 1\n\nXY , s1},{\u27132,P 2\n\nXY , s2}) = \u232bk\u2713(\u27131,\u2713 2)kp([ (D1), s1], [ (D2), s2])\n\nwhere \u232b is a constant, k\u2713 and kp is the standard Mat\u00e9rn-3/2 kernel (with separate bandwidths across\nthe dimensions). For classi\ufb01cation, we additionally concatenate the class size ratio per class, as this\n4For classi\ufb01cation, we use \u02c6CXY and a one-hot encoding for y implying a marginal embedding per class.\n5For different noise levels across tasks, we can allow for different 2\n\ni per task i in distGP and distBLR.\n\n4\n\n\fk,P i\n\nXY , si), zi\n\nk=1on\nk}Ni\n\nis not captured in (Di). Utilisingn{(\u2713i\n\n, we can optimise \u00b5, \u232b, 2 and any\nparameters in (D), k\u2713 and kp using the marginal likelihood of the GP (in an end-to-end fashion).\nBayesian Linear Regression (distBLR). While GP with its well-calibrated uncertainties have shown\nsuperior performance in BO [34], it is well known that they suffer from O(N 3) computational\ni=1 Ni, we\nmight \ufb01nd that the total number of evaluations across all tasks is too large for the GP inference to be\ntractable or that the computational burden of GPs outweighs the cost of computing f in the \ufb01rst place.\nTo overcome this problem, we will follow [26] and use Bayesian linear regression (BLR), which\nscales linearly in the number of observations, with the model given by\n\ncomplexity [31], where N is the total number of observations. In this case, as N =Pn\n\ni=1\n\nz| \u21e0N (\u2325, 2I)\n1, 1]), . . . , ([\u27131\n\n\u2325= [ ([\u27131\n\n \u21e0N (0,\u21b5I ) \n\ni = [ (Di), si]\n\nN1, 1]), . . . , ([\u2713n\n\n1 , n]), . . . , ([\u2713n\n\nNn, n])]> 2 RN\u21e5d\n\nwhere \u21b5> 0 denotes the prior regularisation, and [\u00b7,\u00b7] denotes concatentation. Here denotes a\nfeature map on concatenated hyperparameters \u2713, data embedding (D) and sample size s. Following\n[26], we also employ a neural network for . While conceptually similar to [26] who \ufb01ts a BLR per\ntask, here we consider a single BLR \ufb01tted jointly on all tasks, highlighting differences across tasks\nusing meta-information available. The advantage of our approach is that for a given new task, we are\nable to utilise directly all previous information and one-shot predict hyperparameters without seeing\nany evaluations from the target task. This is especially important when our goal might be to employ\nour system with only a few evaluations from our target task. In addition, a separately trained target\ntask BLR is likely to be poorly \ufb01tted given only a few evaluations. Similar to the GP case, we can\noptimise \u21b5, , 2 and any unknown parameters in (D), ([\u2713, ]) using the marginal likelihood of\nthe BLR.\n\n4.3 Hyperparameter learning\nHaving constructed a model for f and optimised any unknown parameters through the marginal\nlikelihood, in order to construct a model for the f target, we let f target(\u2713) = f (\u2713, Ptarget\nXY , starget). Now,\nto propose the next \u2713target to evaluate, we can simply proceed with Bayesian optimisation on f target,\ni.e. maximise the corresponding acquisition function \u21b5(\u2713; f target). While we adopt standard BO\ntechniques and acquisition functions here, note that the generality of the developed framework allows\nit to be readily combined with many advances in the BO literature, e.g. [12, 24, 19, 34, 41].\nAcquisition Functions. For the form of the acquisition function \u21b5(\u2713; f target), we will use the popular\nexpected improvement (EI) [22]. However, for the \ufb01rst iteration, EI is not appropriate in our context,\nas these acquisition functions can favour \u2713s with high uncertainty. Recalling that our goal is to quickly\nselect \u2018good\u2019 hyperparameters \u2713 with few evaluations, for the \ufb01rst iteration we will maximise the\nlower con\ufb01dence bound (LCB)6, as we want to penalise uncertainties and exploit our knowledge\nfrom source task\u2019s evaluations. While this approach works well for the GP case, for BLR, we will\nuse the LCB restricted to the best hyperparameters from previous tasks, as BLR with a NN feature\nmap does not extrapolate as well as GPs in the \ufb01rst iteration. For the exact forms of these acquisition\nfunctions, implementation and alternative warm-starting approaches, please refer to Appendix A.3.\nOptimisation. We make use of ADAM [16] to maximise the marginal likelihood until convergence.\nTo ensure relative comparisons, we standardised each task\u2019s dataset features to have mean 0 and\nvariance 1 (except for the unsupervised toy example), with regression labels normalised individually\nto be in [0, 1]. As the sample size per task si is likely to be large, instead of using the full set of\nsamples si to compute (Di), we will use a different random sub-sample of batch-size b for each\niteration of optimisation (i.e. gradients are stochastic). In practice, this parameter b depends on the\nnumber of tasks, and the evaluation cost of f. It should be noted that a smaller batch-size b would\nstill provide an unbiased estimate of (Di) At testing time, it is also possible to use a sub-sample\n\nof the dataset to avoid any computational costs arising from a largePi si. When retraining, we\nwill initialise from the previous set of parameters, hence few gradient steps are required before\nconvergence occurs.\nExtension to other data structures. Throughout the paper, we focus on examples with x 2 Rp.\nHowever our formulation is more general, as we only require the corresponding feature maps to be\n\n6Note this is not the upper con\ufb01dence bound, as we want to exploit and obtain a good starting initialisation.\n\n5\n\n\fFigure 1: Unsupervised toy task over 30 runs. Left: Mean of the maximum observed f target so\nfar (including any initialisation). Right: Mean of the similarity measure kp( (Di), (Dtarget)) for\ndistGP. For clarity purposes, the legend only shows the \u00b5i for the 3 source tasks that are similar to the\ntarget task with \u00b5i = 0.25. It is noted the rest of the source task have \u00b5i \u21e1 4.\n\nde\ufb01ned on individual covariates and labels. For example, image data can be modelled by taking x(x)\nto be a representation given by a convolutional neural network (CNN)7, while for text data, we might\nconstruct features using Word2vec [21], and then retrain these representations for hyperparameter\nlearning setting. More broadly, we can initialize (D) to any meaningful representation of the\ndata, believed to be useful to the selection of \u2713\u21e4target. Of course, we can also choose (D) simply\nas a selection of handcrafted meta-features [13, Ch. 2], in which case our methodology would use\nthese representations to measure similarity between tasks, while performing feature selection [39].\nIn practice, learned feature maps via kernel mean embeddings can be used in conjunction with\nhandcrafted meta-features, letting data speak for itself. In Appendix B.1, we provide a selection of 13\nhandcrafted meta-features that we employ as baselines for the experiments below.\n\n5 Experiments\n\nWe will denote our methodology distBO, with BO being a placeholder for GP and BLR versions.\nFor x and y we will use a single hidden layer NN with tanh activation (with 20 hidden and 10\noutput units), except for classi\ufb01cation tasks, where we use a one-hot encoding for y. We further\ninvestigate this choice of NN structure in Appendix C.6 for the Protein dataset (results are fairly\nrobust). For clarity purposes, we will focus on the approach where we separately embed the marginal\nand conditional distributions, before concatenation. Additional results for embedding the joint\ndistribution can be found in Appendix C.1. For BLR, we will follow [26] and take feature map to\nbe a NN with three 50-unit layers and tanh activation.\nFor baselines, we will consider: 1) manualBO with (D) as the selection of 13 handcrafted meta-\nfeatures; 2) multiBO, i.e. multiGP [38] and multiBLR [26] where no meta-information is used, i.e.\ntask is simply encoded by its index (they are initialised with 1 random iteration); 3) initBO [8] with\nplain Bayesian optimisation, but warm-started with the top 3 hyperparameters, from the three most\nsimilar source tasks, computing the similarity with the `2 distance on handcrafted meta-features; 4)\nnoneBO denoting the plain Bayesian optimisation [34], with no previous task information; 5) RS\ndenoting the random search. In all cases, both GP and BLR versions are considered.\nWe use TensorFlow [1] for implementation, repeating each experiment 30 times, either through\nre-sampling (toy) or re-splitting the train/test partition (real life data). For testing, we use the same\nnumber of samples si for toy data, while using a 60-40 train-test split for real data. We take the\nembedding batch-size8 b = 1000, and learning rate for ADAM to be 0.005. To obtain {\u2713i\nk}Ni\nk=1\nfor source task i, we use noneGP to simulate a realistic scenario. Additional details on these baselines\n7This is similar to [18] who embeds distribution of images using a pre-trained CNN for distribution regression.\n8Training time is less than 2 minutes on a standard 2.60GHz single-core CPU in all experiments.\n\nk, zi\n\n6\n\n\fFigure 2: Mean of the similarity measure kp( (Di), (Dtarget)) over 30 runs versus number of\niterations for the unsupervised toy task. For clarity purposes, the legend only shows the \u00b5i for the 3\nsource tasks that are similar to the target task with \u00b5i = 0.25. It is noted the rest of the source task\nhave \u00b5i \u21e1 4. Left: distGP Middle: manualGP Right: multiGP\n\nand implementation can be found in Appendix B and C, with additional toy (non-similar source tasks\nscenario) and real life (Parkinson\u2019s dataset) experiments to be found in Appendix C.4 and C.5.\n\n`=1 xi\n\nsiPsi\n\n2\n\n`=1 xi\n\n(\u2713 1\n\n5.1 Toy example.\nTo understand the various characteristics of the different methodologies, we \ufb01rst consider an \"un-\nsupervised\" toy 1-dimensional example, where the dataset Di follows the generative process for\n`|\u00b5i i.i.d.\u21e0N (\u00b5i, 1). We can think of \u00b5i as the (unobserved) relevant\nsome \ufb01xed i: \u00b5i \u21e0N (i, 1); xi\n`}si\nproperty varying across tasks, and the unlabelled dataset as Di = {xi\n`=1. Here, we will consider\nthe objective f given by:\nf (\u2713; Di) = exp \n! ,\n`)2\nsiPsi\n\nwhere \u2713 2 [8, 8] plays the role of a \u2018hyperparameter\u2019 that we would like to select. Here, the optimal\nchoice for task i is \u2713 = 1\n` and hence it is varying together with the underlying mean \u00b5i of\nthe sampling distribution. An illustration of this experiment can be found in Figure 7 in Appendix\nC.2.\nWe now perform an experiment with n = 15, and si = 500, for all i, and generate 3 source tasks\nwith i = 0, and 12 source task with i = 4. In addition, we generate an additional target dataset\nwith target = 0 and let the number of source evaluations per task be Ni = 30.\nThe results can be found in Figure 1. Here, we observe that distBO has correctly learnt to utilise the\nappropriate source tasks, and it is able to few-shot the optimum. This is also evident on the right of\nFigure 1, which shows the similarity measure kp( (Di), (Dtarget)) 2 [0, 1] for distGP. The feature\nrepresentation has correctly learned to place high similarity on the three source datasets sharing the\nsame i and hence having similar values of \u00b5i, while placing low similarity on the other source\ndatasets. As expected, manualBO also few-shots the optimum here since the mean meta-feature\nwhich directly reveals the optimal hyperparameter was explicitly encoded in the hand-crafted ones.\ninitBO starts reasonably well, but converges slowly, since the optimal hyperparameters even in the\nsimilar source tasks are not the same as that of the target task. It is also notable that multiBO is\nunable to few-shot the optimum, as it does not make use of any meta-information, hence needing\ninitialisations from the target task to even begin learning the similarity across tasks. This is especially\nhighlighted in Figure 2, which shows an incorrect similarity in the \ufb01rst few iterations. Signi\ufb01cance is\nshown in the mean rank graph found in Figure 8 in Appendix C.2.\n\n5.2 When handcrafted meta-features fail.\nWe now demonstrate an example in which using handcrafted meta-features does not capture any\ninformation about the optimal hyperparameters of the target task. Consider the following process for\n\n7\n\n\fFigure 3: Handcrafted meta-features counterexample over 30 runs, with 50 iterations Left: Mean of\nthe maximum observed f target so far (including any initialisation). Right: Mean of the similarity\nmeasure kp( (Di), (Dtarget)) for distGP, the target task uses the same generative process as i = 2.\n\nyi\n\n[xi\n\n`.\n\n(1)\n\ndataset i with xi\n\nj = 1, . . . , 6,\n\n` 2 R6 and yi\n` 2 R, given by:\n\u21e5xi\n`\u21e4j\ni.i.d.\u21e0N (0, 22),\n`]i+2 ,\n`]2)[xi\n`\u21e4i+2 = sign([xi\n\u21e5xi\n` = log0B@1 +0@ Yj2{1,2,i+2}\n31CA + \u270fi\n`]j1A\n\n`]1[xi\n\niid\u21e0N (0, 0.52), with index i, `, j denoting task, sample and dimension, respectively: i =\nwhere \u270fi\n`\n1, . . . , 4 and ` = 1, . . . , si with sample size si = 5000. Thus across n = 4 source tasks, we have\nconstructed regression problems, where the dimensions which are relevant (namely 1, 2 and i + 2)\nare varying. Note that (1) introduces a three-variable interaction in the relevant dimensions, but that\nall dimensions remain pairwise independent and identically distributed. Thus, while these tasks are\ninherently different, this difference is invisible by considering marginal distribution of covariates and\ntheir pairwise relationships such as covariances. As the handcrafted meta-features for manualBO\nonly consider statistics which process one or two dimensions at the time or landmarkers [27], their\ncorresponding (Di) are invariant to tasks up to sampling variations. For an in-depth discussion, see\nAppendix C.3. We now generate an additional target dataset, using the same generative process as\ni = 2, and let f be the coef\ufb01cient of determinant (R2) on the test set resulting from an automatic\nrelevance determination (ARD) kernel ridge regression with hyperparameters \u21b5 and 1, . . . , 6. Here\n\u21b5 denotes the regularisation parameter, while j denotes the kernel bandwidth for dimension j.\nSetting Ni = 125, the results can be found in Figure 3 (GP) and Figure 9 in Appendix C.3 (BLR). It\nis clear that while distBO is able to learn a high similarity to the correct source task (as shown in\nFigure 3), and one-shot the optimum, this is not the case for any of the other baselines (Figure 10 in\nAppendix C.3) . In fact, as manualBO\u2019s meta-features do not include any useful meta-information,\nthey essentially encode the task index, and hence perform similarly to multiBO. Further, we observe\nthat initBO has slow convergence after warm-starting. This is not surprising as initBO has to \u2018re-\nexplore\u2019 the hyperparameter space as it only uses a subset of previous evaluations. This highlights the\nimportance of using all evaluations from all source tasks, even if they are sub-optimal. In Figure 9 in\nAppendix C.3, we show signi\ufb01cance using a mean rank graph and that the BLR methods performs\nsimilarly to their GP counterparts.\n\n8\n\n\fFigure 4: Each evaluation is the maximum observed accuracy rate averaged over 140 runs, with 20\nruns on each of the protein as target. Left: Jaccard kernel C-SVM. Right: Random forest\n\n5.3 Classi\ufb01cation: Protein dataset.\nWe now apply the methodologies to a real life protein-ligand binding problem in the area of drug\ndesign. In particular, the Protein dataset consists of 7 different proteins extracted from [9]: ADAM17,\nAKT1, BRAF, COX1, FXA, GR, VEGFR2. Each protein dataset contains 1037 4434 molecules\n(data-points si), where each molecule has binary features xi\n` 2 R166 computed using a chemical\n\ufb01ngerprint (MACCs Keys9). The label per molecule is whether the molecule can bind to the protein\ntarget 2{ 0, 1}. In this experiment, we can treat each protein as a separate classi\ufb01cation task. We\nconsider two classi\ufb01cation methods: Jaccard kernel C-SVM [5, 30] (commonly used for binary\ndata, with hyperparameter C), and random forest (with hyperparameters n_trees, max_depth,\nmin_samples_split, min_samples_leaf), with the corresponding objective f given by accuracy\nrate on the test set. In this experiment, we will designate each protein as the target task, while using\nthe other n = 6 proteins as source tasks. In particular, we will take Ni = 20 and hence N = 120.\nThe results obtained by averaging over different proteins as the target task (20 runs per task) are\nshown in Figure 4 (with mean rank graphs and BLR version to be found in Figure 14 and 15 in\nAppendix C.6). On this dataset, we observe that distGP outperforms its counterpart baselines and\nfew-shots the optimum for both algorithms. In addition, we can see a slower convergence for the\nmultiGP and initGP, demonstrating the usefulness of meta information in this context.\n\n6 Conclusion\n\nWe demonstrated that it is possible to borrow strength between multiple hyperparameter learning\ntasks by making use of the similarity between training datasets used in those tasks. This helped us\nto develop a method which \ufb01nds a favourable setting of hyperparameters in only a few evaluations\nof the target objective. We argue that the model performance should not be treated as a black box\nfunction as it corresponds to speci\ufb01c known models and speci\ufb01c datasets. We demonstrate that its\ncareful consideration as a function of all its inputs, and not just of its hyperparameters, can lead to\nuseful algorithms.\n\n7 Acknowledgements\n\nWe thank Kaspar Martens, Jin Xu, Wittawat Jitkrittum and Jean-Francois Ton for useful discussions.\nHCLL is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1).\nDS is supported in part by the ERC (FP7/617071) and by The Alan Turing Institute (EP/N510129/1).\nHCLL partially completed this work at Tencent AI Lab, and HCLL and DS are supported in part by\nthe Oxford-Tencent Collaboration on Large Scale Machine Learning.\n\n9http://rdkit.org/docs/source/rdkit.Chem.MACCSkeys.html\n\n9\n\n\fReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: a system for\nlarge-scale machine learning.\n\n[2] R\u00e9mi Bardenet, M\u00e1ty\u00e1s Brendel, Bal\u00e1zs K\u00e9gl, and Michele Sebag. Collaborative hyperparameter\n\ntuning. In International Conference on Machine Learning, pages 199\u2013207, 2013.\n\n[3] C.M. Bishop. Pattern recognition and machine learning. Springer New York, 2006.\n[4] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott.\nDomain generalization by marginal transfer learning. arXiv preprint arXiv:1711.07910, 2017.\n[5] Mathieu Bouchard, Anne-Laure Jousselme, and Pierre-Emmanuel Dor\u00e9. A proof for the positive\nde\ufb01niteness of the jaccard index matrix. International Journal of Approximate Reasoning,\n54(5):615\u2013626, 2013.\n\n[6] Matthias Feurer, Benjamin Letham, and Eytan Bakshy. Scalable meta-learning for bayesian\noptimization using ranking-weighted gaussian process ensembles. In AutoML Workshop at\nICML, 2018.\n\n[7] Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Using meta-learning to initialize\nbayesian optimization of hyperparameters. In Proceedings of the 2014 International Conference\non Meta-learning and Algorithm Selection-Volume 1201, pages 3\u201310. Citeseer, 2014.\n\n[8] Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Initializing bayesian hyperparam-\n\neter optimization via meta-learning. 2015.\n\n[9] Anna Gaulton, Anne Hersey, Micha\u0142 Nowotka, A Patr\u00edcia Bento, Jon Chambers, David Mendez,\nPrudence Mutowo, Francis Atkinson, Louisa J Bellis, Elena Cibri\u00e1n-Uhalte, et al. The chembl\ndatabase in 2017. Nucleic acids research, 45(D1):D945\u2013D954, 2016.\n\n[10] Taciana AF Gomes, Ricardo BC Prud\u00eancio, Carlos Soares, Andr\u00e9 LD Rossi, and Andr\u00e9\nCarvalho. Combining meta-learning and search techniques to select parameters for support\nvector machines. Neurocomputing, 75(1):3\u201313, 2012.\n\n[11] Arthur Gretton. Notes on mean embeddings and covariance operators. 2015.\n[12] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Matthew W. Hoffman, and Zoubin Ghahramani. Predictive\nentropy search for ef\ufb01cient global optimization of black-box functions. In Advances in Neural\nInformation Processing Systems, pages 918\u2013926, Cambridge, MA, USA, 2014. MIT Press.\n\n[13] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors. Automatic Machine Learning:\n\nMethods, Systems, Challenges. Springer, 2019.\n\n[14] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scienti\ufb01c tools for\n\nPython, 2001\u2013. [Online; accessed ].\n\n[15] Jungtaek Kim, Saehoon Kim, and Seungjin Choi. Learning to transfer initializations for bayesian\n\nhyperparameter optimization. arXiv preprint arXiv:1710.06219, 2017.\n\n[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[17] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast\nbayesian optimization of machine learning hyperparameters on large datasets. arXiv preprint\narXiv:1605.07079, 2016.\n\n[18] Ho Chung Leon Law, Dougal Sutherland, Dino Sejdinovic, and Seth Flaxman. Bayesian\napproaches to distribution regression. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 1167\u20131176, 2018.\n\n[19] Mark McLeod, Michael A. Osborne, and Stephen J. Roberts. Optimization, fast and slow: opti-\nmally switching between local and Bayesian optimization. In Proceedings of the International\nConference on Machine Learning (ICML), May 2018.\n\n10\n\n\f[20] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine learning, neural and statistical\n\nclassi\ufb01cation. 1994.\n\n[21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[22] J Mo\u02c7ckus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP\n\nTechnical Conference, pages 400\u2013404. Springer, 1975.\n\n[23] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch\u00f6lkopf, et al. Kernel\nmean embedding of distributions: A review and beyond. Foundations and Trends R in Machine\nLearning, 10(1-2):1\u2013141, 2017.\n\n[24] ChangYong Oh, Efstratios Gavves, and Max Welling. Bock: Bayesian optimization with\n\ncylindrical kernels. arXiv preprint arXiv:1806.01619, 2018.\n\n[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[26] Valerio Perrone, Rodolphe Jenatton, Matthias W Seeger, and Cedric Archambeau. Scalable\nhyperparameter transfer learning. In Advances in Neural Information Processing Systems, pages\n6846\u20136856, 2018.\n\n[27] Bernhard Pfahringer, Hilan Bensusan, and Christophe G Giraud-Carrier. Meta-learning by\n\nlandmarking various learning algorithms.\n\n[28] Matthias Poloczek, Jialei Wang, and Peter I Frazier. Warm starting bayesian optimization. In\n\nProceedings of the 2016 Winter Simulation Conference, pages 770\u2013781. IEEE Press, 2016.\n\n[29] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[30] Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical\n\ninformatics. Neural networks, 18(8):1093\u20131110, 2005.\n\n[31] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on\n\nmachine learning, pages 63\u201371. Springer, 2004.\n\n[32] Matthias Reif, Faisal Shafait, and Andreas Dengel. Meta-learning for evolutionary parameter\n\noptimization of classi\ufb01ers. Machine learning, 87(3):357\u2013380, 2012.\n\n[33] Gregory A Ross, Garrett M Morris, and Philip C Biggin. One size does not \ufb01t all: the limits\nof structure-based models in drug discovery. Journal of chemical theory and computation,\n9(9):4266\u20134274, 2013.\n\n[34] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine\nlearning algorithms. In Advances in neural information processing systems, pages 2951\u20132959,\n2012.\n\n[35] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions:\nA uni\ufb01ed kernel framework for nonparametric inference in graphical models. Signal Processing\nMagazine, IEEE, 30(4):98\u2013111, 2013.\n\n[36] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimization\nwith robust bayesian neural networks. In Advances in Neural Information Processing Systems,\npages 4134\u20134142, 2016.\n\n[37] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian pro-\ncess optimization in the bandit setting: No regret and experimental design. arXiv preprint\narXiv:0912.3995, 2009.\n\n11\n\n\f[38] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization.\n\nAdvances in neural information processing systems, pages 2004\u20132012, 2013.\n\nIn\n\n[39] Ljupco Todorovski, Pavel Brazdil, and Carlos Soares. Report on the experiments with feature\nselection in meta-level learning. In Proceedings of the PKDD-00 workshop on data mining, deci-\nsion support, meta-learning and ILP: forum for practical problem presentation and prospective\nsolutions. Citeseer, 2000.\n\n[40] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked\n\nscience in machine learning. SIGKDD Explorations, 15(2):49\u201360, 2013.\n\n[41] Jialei Wang, Scott C Clark, Eric Liu, and Peter I Frazier. Parallel bayesian global optimization\n\nof expensive functions. arXiv preprint arXiv:1602.05149, 2016.\n\n[42] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel\n\nlearning. In Arti\ufb01cial Intelligence and Statistics, pages 370\u2013378, 2016.\n\n[43] Martin Wistuba, Nicolas Schilling, and Lars Schmidt-Thieme. Scalable gaussian process-based\ntransfer surrogates for hyperparameter optimization. Machine Learning, 107(1):43\u201378, 2018.\n[44] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov,\nand Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems,\npages 3391\u20133401, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3687, "authors": [{"given_name": "Ho Chung", "family_name": "Law", "institution": "University of Oxford"}, {"given_name": "Peilin", "family_name": "Zhao", "institution": "Tencent AI Lab"}, {"given_name": "Leung Sing", "family_name": "Chan", "institution": "University of Oxford"}, {"given_name": "Junzhou", "family_name": "Huang", "institution": "University of Texas at Arlington / Tencent AI Lab"}, {"given_name": "Dino", "family_name": "Sejdinovic", "institution": "University of Oxford"}]}