{"title": "Scalable Hyperparameter Transfer Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6845, "page_last": 6855, "abstract": "Bayesian optimization (BO) is a model-based approach for gradient-free black-box function optimization, such as hyperparameter optimization. Typically, BO relies on conventional Gaussian process (GP) regression, whose algorithmic complexity is cubic in the number of evaluations. As a result, GP-based BO cannot leverage large numbers of past function evaluations, for example, to warm-start related BO runs. We propose a multi-task adaptive Bayesian linear regression model for transfer learning in BO, whose complexity is linear in the function evaluations: one Bayesian linear regression model is associated to each black-box function optimization problem (or task), while transfer learning is achieved by coupling the models through a shared deep neural net. Experiments show that the neural net learns a representation suitable for warm-starting the black-box optimization problems and that BO runs can be accelerated when the target black-box function (e.g., validation loss) is learned together with other related signals (e.g., training loss). The proposed method was found to be at least one order of magnitude faster that methods recently published in the literature.", "full_text": "Scalable Hyperparameter Transfer Learning\n\nValerio Perrone, Rodolphe Jenatton, Matthias Seeger, C\u00e9dric Archambeau\n\n{vperrone, jenatton, matthis, cedrica}@amazon.com\n\nAmazon\n\nBerlin, Germany\n\nAbstract\n\nBayesian optimization (BO) is a model-based approach for gradient-free black-box\nfunction optimization, such as hyperparameter optimization. Typically, BO relies\non conventional Gaussian process (GP) regression, whose algorithmic complexity\nis cubic in the number of evaluations. As a result, GP-based BO cannot leverage\nlarge numbers of past function evaluations, for example, to warm-start related\nBO runs. We propose a multi-task adaptive Bayesian linear regression model for\ntransfer learning in BO, whose complexity is linear in the function evaluations:\none Bayesian linear regression model is associated to each black-box function\noptimization problem (or task), while transfer learning is achieved by coupling\nthe models through a shared deep neural net. Experiments show that the neural\nnet learns a representation suitable for warm-starting the black-box optimization\nproblems and that BO runs can be accelerated when the target black-box function\n(e.g., validation loss) is learned together with other related signals (e.g., training\nloss). The proposed method was found to be at least one order of magnitude faster\nthan competing methods recently published in the literature.\n\n1\n\nIntroduction\n\nBayesian optimization (BO) is a well-established methodology to optimize expensive black-box\nfunctions [1]. It relies on a probabilistic model of an unknown target f (x) one wishes to optimize and\nwhich is repeatedly queried until one runs out of budget (e.g., time). Queries consist in evaluations\nof f at hyperparameter con\ufb01gurations x1, . . . , xn selected according to an explore-exploit trade-off\ncriterion or acquisition function [2, 3, 1]. The hyperparameter con\ufb01guration corresponding to the\nbest query is then returned. One popular approach is to impose a Gaussian process (GP) prior over\nf and, in light of the observed queries f (x1), . . . , f (xn), to compute the posterior GP. The GP\nmodel maintains a posterior mean function and a posterior variance function that are required when\nevaluating the acquisition function for each new query of f.\nDespite their \ufb02exibility and ability to calibrate the predictive uncertainty, standard GPs scale cubically\nwith the number of observations [4]. Hence, they cannot be applied in situations where f has been or\ncan be queried a very large number of times. A possible alternative is to consider sparse GPs, which\nscale linearly in the number of observations and quadratically in the number of inducing points [5, 6].\nHowever, tractability requires the number of inducing points to be much smaller than the number of\nobservations, resulting in a severe deterioration of the predictive performance in practice [7].\nIn this work, we aim to warm-start BO in the context of hyperparameter optimization (HPO). Our\ngoal is to learn across related black-box optimization problems by transferring information between\nthem, thus leveraging data from previous BO runs. For example, we warm-start the HPO of a given\nclassi\ufb01er when it is applied to a battery of reference data sets. Earlier work adopting this transfer\nlearning perspective includes [8, 9]. To circumvent the scalability limitation of GPs and enable\ntransfer learning in HPO at scale, we propose falling back to adaptive Bayesian linear regression\n(ABLR) [10], which scales linearly in the number of observations and cubically in the dimension of a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flearned basis function expansion, hence the name adaptive. Our \ufb01rst contribution is to extend ABLR\nto the multi-task setting. Here, a task denotes a black-box function optimization, associated with its\nown Bayesian linear regression surrogate. These models share an underlying feedforward neural\nnet (NN) that learns a shared basis expansion (or representation) from the HPO data. Our second\ncontribution is to learn this representation while performing BO. This is achieved by integrating out\nthe linear regression weights and learning the remaining parameters of each linear regression and\nNN weights by optimizing the overall log marginal likelihood. To this end, we leverage automatic\ndifferentiation operators recently contributed to MXNet [11, 12].\nIt is well-known that BLR can be seen as GP regression with a linear kernel, and the linear scaling\nwith respect to the number of observations has long been established [13]. Most published work\non HPO uses GPs with nonlinear kernels, as they provide more realistic predictive variances away\nfrom the observations [14, 15]. Our simpler and more scalable ABLR models are known to have\nweaknesses in that respect, which are shared by most other scalable GP approximations [7]. A notable\nexception is Deep Kernel Learning (DKL) [16]: a GP approximation with better predictive variance\nproperties. ABLR can be viewed as a special case of DKL that is substantially simpler to implement\nand less costly to run, as it does not involve unrolling iterative linear solvers in a deep NN computation\ngraph, nor specialized numerical interpolation code. While a sparse GP approximation was applied\nto standard BO in [17], we are not aware of previous work applying scalable GP approximations,\nincluding DKL, to transfer learning in HPO.\nThe paper is organized as follows. Section 2 summarizes the BO paradigm and relates our contribu-\ntions to the state-of-the-art. Section 3 introduces our multi-task adaptive Bayesian linear regression,\ndetailing inference in this model and discussing computational properties. We also explain how\nthe model and its attractive computational properties can be exploited for ef\ufb01cient transfer learning.\nSection 4 presents experiments on simulated and real data, reporting favorable comparisons with\nexisting alternatives when leveraging data across auxiliary tasks and signals. We conclude with a\ndiscussion of possible extensions in Section 5.\n\n2 Background\nConsider the problem of optimizing a black-box function f (x) : X \u2192 R over a convex set X \u2282 R,\nnamely a function whose analytical form and gradients are unavailable and that can only be queried\nthrough expensive and potentially noisy evaluations. For instance, suppose f (x) is the test error\nassociated to a deep neural network as a function of its hyper-parameters x (e.g., the number of\nlayers, units, and type of activation functions). In this setting, each evaluation of f (x) is typically\nvery expensive as it requires training the neural network model.\nBO is an ef\ufb01cient approach to \ufb01nd x(cid:63) = argminx\u2208X f (x). The idea is to place a surrogate model\nover the target black-box, and update it sequentially by querying f (x) at new points that optimize\nan acquisition function, effectively trading off exploration and exploitation. Let M be a cheaper-\nto-evaluate surrogate model of f (x), and C a set of evaluated candidates. The canonical BO loop\niterates the following steps until some given budget, such as time, is exceeded:\n\n1. Select a candidate xnew \u2208 X that optimizes a given acquisition function based on M and C.\n2. Collect an evaluation ynew at f (xnew).\n3. Update C = C \u222a {(xnew, ynew)}.\n4. Update the surrogate model M based on C.\n5. Update the budget.\n\nA variety of models M have been used to describe the black-box function f (x), with GPs being a\ncommon choice. In the next subsection, we review a set of alternative models that have been proposed\nto either overcome the scalability limits of GPs or extend BO to optimize multiple related black-box\nfunctions.\n\n2.1 Related work\n\nOur work is most closely related to DNGO [18], where a scalable ABLR model is used for BO.\nHowever, this model is limited to a single task with many evaluations, while we aim to transfer\n\n2\n\n\flearning across multiple tasks. In addition, we do not use their two-step learning procedure. Namely,\nthe authors \ufb01rst train the NN together with a \ufb01nal deterministic linear regression layer, relying on\nstandard deep learning software. Then, they discard the \ufb01nal layer and replace it with a Bayesian linear\nregression model in order to drive BO. Our empirical results indicate that joint Bayesian learning\nof the ABLR parameters and the underlying NN parameters is bene\ufb01cial, justifying the additional\ncomplexity in our implementation. Our procedure naturally extends to handling heterogeneous signals.\nMoreover, our comparisons with DNGO in Section 4 show that our approach runs signi\ufb01cantly faster.\nMore details on the relationship to DNGO are provided in Section 3.1 and 3.2.\nAnother related model is BOHAMIANN [15]. The authors propose using Bayesian NNs [19] to\nsample from the posterior over f and add task-speci\ufb01c embeddings to the NN inputs to handle\nmulti-task black-box optimization. While allowing for a principled treatment of uncertainties, fully\nBayesian NNs are computationally expensive and their training can be sensitive to the stochastic\ngradient MCMC procedure used to handle the hyperparameters. Our model allows for simpler\ninference and is parameter free, making it more suitable for deployment at a scale in practice. As\nshown in Section 4, it runs much faster than BOHAMIANN.\nAnother line of related research does not rely on NNs. Feurer et al. [20] warm-start HPO with the best\nhyperparameters for the most similar black-box function, where similarity is measured by distance\nbetween the corresponding meta-data. Our multi-task ABLR model learns a useful shared feature\nbasis even in the absence of task meta-data. It is able to draw information from all previous function\nevaluations, without having to restrict itself to the best solution from previous BO runs. This is\nsimilar in spirit to previous work [21, 22, 23], where the covariance matrix of a GP is designed to use\nthe entire set of previous evaluations and capture black-box function similarities. Multi-task ABLR\nmakes it possible to fully embrace this idea, as it can leverage orders of magnitude more observations\nthan with a GP-based approach.\nFinally, a number of models have been proposed speci\ufb01cally in the context of transfer learning for\nHPO. Schilling et al. [24] model the interaction between data sets and optimal hyperparameters\nexplicitly with a factorized multilayer perceptron. Since this model cannot represent uncertainties, an\nensemble of 100 multilayer perceptrons is trained to get predictive means and simulate variances.\nGolovin et al. [25] consider transfer learning in the particular setting where a sequence order (e.g.,\ntime) is assumed across the BO runs; we do not require this assumption. A different approach is\ntaken by Wistuba et al. [26, 27]. In the former work, a meta-loss function is minimized to learn\ninitial hyperparameter con\ufb01gurations. However, their method requires manually setting a kernel\nbandwidth to combine the predictive means of the past models, and an ad hoc procedure for the\nuncertainty which ignores the predictive variances of the past models. In the latter work, a two-stage\nsurrogate model is considered: an independent GP is trained for each data set, after which kernel\nregression combines the GPs into an overall surrogate model for BO. The idea of using a mixture of\nGP experts and learning the weights of the ensemble is also proposed in Feurer et al. [28]. While the\nresulting models are able to exploit data set similarities, the cubic scaling makes GP-based approaches\nunfeasible with a large number of evaluations.\n\n3 Multi-task Adaptive Bayesian Linear Regression\nConsider T tasks, which consist in the target black-box functions {ft(\u00b7)}T\nt=1 we would like to optimize\nand which are related in some way (e.g., the validation losses of a classi\ufb01cation model learned on\ndifferent data sets). We have evaluated ft(\u00b7) Nt times, resulting in the data Dt = {(xn\nt )}Nt\nn=1,\nalso denoted by Xt \u2208 RNt\u00d7P and yt \u2208 RNt in stacked form. Our joint model for the responses yt\nconsists of two parts. First, we use a shared feature map \u03c6z(x) : RP (cid:55)\u2192 RD. In our main use case,\n\u03c6z(x) is a feedforward NN with D output units, akin the model proposed in [29], and vector z collects\nt )]n \u2208 RNt\u00d7D.\nall its weights and biases. We collect the features in matrices \u03a6t = \u03a6z(Xt) = [\u03c6z(xn\nSecond, we employ separate Bayesian linear regression surrogates that share the feature map \u03c6z(x)\nto model the black-box functions:\n\nt , yn\n\nP (yt|wt, z, \u03b2t) = N (\u03a6twt, \u03b2\u22121\n\nt INt), P (wt|\u03b1t) = N (0, \u03b1\u22121\n\nt ID),\n\nwhere \u03b2t > 0 and \u03b1t > 0 are precision (i.e., inverse variance) parameters. The model adapts\nto the scale and the noise level of the black-box function ft via \u03b2t and \u03b1t, while the underlying\nNN parametrized by a shared vector z learns a representation to transfer information between the\nblack-box functions. Importantly, the weights wt parametrizing the tth Bayesian linear regression\n\n3\n\n\fare treated as latent variables and integrated out, while the remaining parameters \u03b1t, \u03b2t and z are\nlearned. The ABLR model can be seen as a NN whose \ufb01nal linear layers are Bayesian in the sense\nthat their weights are integrated out rather than learned, or as a set of Bayesian linear regressions\nwith a shared feature set learned by the NN. Note that Bayesian inference is analytically tractable and\ncomputationally ef\ufb01cient if restricted to the linear regression weights {wt}T\nt=1. Next, we provide\nexpressions for the ABLR predictive probabilities and learning criterion. Detailed derivations can be\nfound in the supplemental material.\n\nt ) =\n\nt |x\u2217\n\nt |x\u2217\n\n(\u03c6\u2217\n\n\u03b2t\n\u03b1t\n\nt = \u03c6z(x\u2217\n\nt ) is a new input for task t and f\u2217\n\nt ,Dt) =(cid:82) P (f\u2217\n\nt \u03c6\u2217\nt = w(cid:62)\nt ,Dt) = N (\u00b5t(x\u2217\nt ), \u03c32\n(\u03c6\u2217\n1\nt , \u03c32\n\u03b1t\n\n3.1 Posterior Inference and Learning\nFixing the NN parameters and the precisions, the posterior distribution P (wt|Dt) over the linear\nregression weights are multivariate Gaussians, whose parameters can be computed analytically [10].\nMoreover, if \u03c6\u2217\nt is the noise-free function value,\nt , wt)P (wt|Dt) dwt is Gaussian as well. We\nthe predictive distribution P (f\u2217\nt |x\u2217\nshow in the supplemental material that P (f\u2217\nt L\u22121\nt \u03a6(cid:62)\n\u00b5t(x\u2217\nt \u03c6\u2217\ne(cid:62)\n1\n\u03b1t\nHere, Kt = \u03b2t\u03b1\u22121\nt \u03a6t + ID, and Lt is its Cholesky factor: Kt = LtL(cid:62)\nt . Moreover,\net = L\u22121\nt yt. The predictive mean and the predictive variance drive the BO. Indeed, these\nare required to compute the acquisition function, which is instrumental to identify the most promising\nhyperparameters to evaluate (see [1] for a review and possible acquisition functions).\nA key difference between our treatment of ABLR and DNGO is how the parameters {\u03b1t, \u03b2t}T\nt=1\nand z are learned. In DNGO, the NN weights z and the weights of the \ufb01nal layer w1 (they consider\nT = 1 only) are \ufb01rst learned by stochastic gradient descent. Next, z is \ufb01xed while w1 are discarded\nand estimated in subsequent BO rounds [18]. By contrast, we make no difference between BO\nand learning, integrating out the latent weights wt in either case. The criterion we minimize is the\nnegative log marginal likelihood of multi-task ABLR:\n\nt )(cid:62)K\u22121\nt \u03a6(cid:62)\n\nt (x\u2217\nt )(cid:62)K\u22121\n\nt )), where\nt \u03c6\u2217\n\nt =\n\n(cid:107)L\u22121\n\nt \u03c6\u2217\n\nt(cid:107)2.\n\nt (x\u2217\n\nt ) =\n\nt \u03a6(cid:62)\n\nt yt =\n\n\u03b2t\n\u03b1t\n\n\u03c1(cid:0)z,{\u03b1t, \u03b2t}T\n\nt=1\n\n(cid:1) = \u2212 T(cid:88)\n\nlog P (yt|z, \u03b1t, \u03b2t),\n\nt=1\n\nwhere the marginal likelihood associated to task t is given by P (yt|z, \u03b1t, \u03b2t) = N (yt|0, \u03b2\u22121\nt INt +\n\u03b1\u22121\nt \u03a6t\u03a6(cid:62)\nt ). As shown in the supplemental material, these quantities can also be expressed in terms\nof the Cholesky factor Lt of Kt. Alternatively, when Nt < D, we can work with the Cholesky factor\nt \u2208 RNt\u00d7Nt instead. Hence, we can compute the learning criterion and its\nof INt + \u03b2t\u03b1\u22121\n\nt \u03a6t\u03a6(cid:62)\n\ngradient in O((cid:80)\n\nt max(Nt, D) min(Nt, D)2).\n\nIn our model, each ABLR could be seen as a GP with shared linear kernel \u03c6z(x1)(cid:62)\u03c6z(x2), parame-\nterized by z. Minimizing \u03c1 is equivalent to learning these \u201ckernel parameters\u201d by empirical Bayes, as\nis routinely done for GPs [4]. By integrating out the linear regression weights, we induce the learned\nfeature map \u03c6z(x) to provide a good representation for covariance and dependencies, not just for\ngood point predictions. By contrast, DNGO jointly learns features and weights of a linear regression\nmodel, hoping that the former give rise to a useful covariance function. The results we present in\nSection 4 provide evidence for the superiority of empirical Bayes, at least in the multi-task setting.\n\n3.2 Computational Implications\n\nOur learning procedure comes with an additional complexity compared to the two-step approach of\nDNGO, where the model is trained using standard deep NN software and stochastic gradient descent\n(SGD) on mini-batches. While our learning criterion decouples as a sum over tasks, it does not\ndecouple over the observations within a task: all Nt observations for task t form a single batch. If the\nnumber of tasks T is moderate, our learning problem is best solved by batch optimization. In our\nexperiments, L-BFGS [30] worked well.\nSince Bayesian learning and optimization are grounded in the same principle, we can re-train all\nmodel parameters as part of BO, whenever new evaluations become available for a task. We adopt this\napproach in all our experiments and noted that L-BFGS re-converges in few steps because parameters\nchange little with each new observation. In situations with a large number of tasks, we could run\n\n4\n\n\fBO on a task t by only updating (\u03b1t, \u03b2t), not retraining the NN or updating the other parameters\n{\u03b2t(cid:48), \u03b1t(cid:48)}t(cid:48)(cid:54)=t. Full model retraining could then be done of\ufb02ine.\nOur learning criterion cannot be expressed in standard deep NN software. Namely, the evaluation\nof the Bayesian linear regression negative log marginal likelihood requires computations such as\nKt (cid:55)\u2192 Lt (Cholesky decomposition) and (Lt, v) (cid:55)\u2192 L\u22121\nt v (backsubstitution). These have to be\navailable as auto-grad operators and should run on, both, CPU and GPU, so they can be \ufb01rst-class\ncitizens in a computation graph. We implemented ABLR in MXNet [12], where a range of linear\nalgebra operators have recently been contributed [11]. Given these operators, our implementation of\nABLR is remarkably concise, and gradients required for model training and the minimization of the\nacquisition function are obtained automatically.\nFrom a practical point of view, our approach has further advantages over DNGO. First, L-BFGS\nis simpler to use than SGD, as no parameters have to be tuned. This is all the more important in\nthe context of BO, where our system has to work robustly on a wide range of problems without\nmanual intervention. Second, we learn the parameters \u03b1t and \u03b2t separately for each task by empirical\nBayes [31], while such parameters would have to be manually tuned in DNGO. The critical importance\nof this point is highlighted in Section 4.4.\n\n3.3 Transfer Learning Settings\n\nIn our experiments in Section 4, we consider a range of different use cases of BO with ABLR. The\n\ufb01rst use case we are interested in is HPO for a single machine learning model across different data\nsets. In this setting, a task consists in tuning the model on one of the data sets. Our goal is to\nwarm-start HPO, so that a smaller number of evaluations are needed on a new data set, using the logs\nof previous HPO runs. The simplest approach is to learn a common feature basis \u03c6z(x) across tasks,\nwhere each task is assigned to a separate marginal log likelihood term. If meta-features about the data\nare further available [20], we can collect them in a context vector ct, and use a map \u03c6z(x, ct) instead:\nthe \ufb01rst part x of the input is variable, while the second part ct is constant across data for a task.\nAnother use case is applying the ABLR model to a number of different signals (which play the\nrole of tasks now). Here, we are interested in speeding up the optimization of one target function\n(e.g., validation loss), by leveraging a number of auxiliary signals (e.g., training cost, training loss\nconsidered at various epochs) which may come as a by-product, or are cheaper to evaluate. Since\nthese different signals can differ widely in scale and noise level, the automatic learning of the scale\nparameter \u03b1t and the noise \u03b2t is vitally important. Note that this set-up is different from a multi-\nobjective scenario, such as the optimization of an average function over multiple tasks as described in\n[21]. Our set-up differs also from [23], since our primary task is \ufb01xed beforehand and we do not seek\nto identify the best source of information at each round.\n\n4 Experimental Evaluation\n\nThe following subsections illustrate the bene\ufb01ts of multi-task ABLR in a variety of settings. In\nSections 4.2 and 4.3, we evaluate its potential to transfer information between tasks de\ufb01ned by,\nrespectively, synthetic data and OpenML data [32]. In Section 4.4, we investigate the transfer learning\nability of ABLR in presence of multiple heterogeneous signals. In either setting, our goal is to\naccelerate BO by leveraging data from the related tasks and signals.\n\n4.1 Experimental Set-up\n\nWe implemented multiple ABLR in GPyOpt [33], with a backend in MXNet [12], using recent linear\nalgebra extensions [11]. The NN that learns the feature map \u03c6z(x) is similar to the one used in [18]. It\nhas three fully connected layers, each with 50 units and tanh activation function. Hence, 50 features\nare fed to the task-speci\ufb01c Bayesian linear regression models. We compare the NN set-up to random\nFourier basis expansions [34], which have been successfully applied to BO [35, 36]. Speci\ufb01cally,\nlet U \u2208 RD\u00d7P and b \u2208 RD be such that U \u223c N (0, I) and {bj}D\nj=1 \u223c U([0, 2\u03c0]). For a vector\n\nx, the mapping is given by \u03c6z(x) =(cid:112)2/D cos(\u03c3\u22121Ux + b), where \u03c3 \u2208 R+ is the bandwidth of\n\nthe approximated radial basis function kernel. We refer to this baseline as RKS (\u201crandom kitchen\nsink\u201d) in the remainder of the paper. It has only a single parameter \u03c3 to optimize, which we learn\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Results obtained on the parametrized quadratics (lower is better): (a) Transfer learning\ncomparison of ABLR against baselines; (b) Transfer learning comparison of NN-based methods; (c)\nRun time of RKS- versus NN-based ABLR. See Section 4.2 for a discussion.\n\nwith the same L-BFGS code (see Section 3.2). We also compare ABLR-based BO to the standard\nGP-based BO, using GPyOpt. The GP has a Mat\u00e9rn-5/2 covariance kernel and automatic relevance\ndetermination hyperparameters, optimized by empirical Bayes [4].\nIn the experiments, we will consider models with and without transfer learning. All models without\ntransfer are initialized according to GPyOpt default settings, that is, with a set of \ufb01ve evaluations\npicked at random. The models with transfer are initialized with one random evaluation from the\ntarget task. All BO experiments use the expected improvement acquisition function [2].\n\n4.2 Transfer learning across parametrized quadratic functions\n\n2 a2,t(cid:107)x(cid:107)2\n\nWe \ufb01rst consider an arti\ufb01cial set-up with T tasks, each given by a quadratic function de\ufb01ned on\n2 + a1,t1(cid:62)x + a0,t, where (a2,t, a1,t, a0,t) belongs to [0.1, 10]3. This triplet\nR3: ft(x) = 1\ncan be thought of as the context associated to each task t. We generated T = 30 different tasks\nby drawing (a2,t, a1,t, a0,t) uniformly at random, and evaluated multi-task ABLR against baseline\nmethods and NN-based methods in a leave-one-task-out fashion. Speci\ufb01cally, we optimized each\none of the 30 tasks after warm-starting the optimization with 10 observations drawn uniformly at\nrandom from each of the remaining 29 tasks. In other words, we have for each task Nt = 10 and\nwarm-starting is therefore based on 290 evaluations. This set of observations is drawn once and taken\nthe same for all other transfer learning methods. The results shown in Figure 1a and 1b are aggregates\nover 10 random repetitions of 30 leave-one-task-out runs.1\nIn Figure 1a, we compare single-task ABLR and standard GP driven HPO with their transfer learning\ncounterparts. Transfer based on contextual information is denoted by ctx, using the context vector\nct = [a2,t, a1,t, a0,t]T . We perform transfer learning in standard GPs by stacking all observations\ntogether and augmenting the input space with the corresponding contextual information [37]. Note\nthat GP transfer learning uses a single marginal likelihood criterion over the data from all tasks,\nwhile ABLR NN transfer learning models the data from different tasks as conditionally independent.\nHPO converges to the minimum much faster for all transfer learning variants (leveraging data from 29\nrelated tasks) than for the single-task ones. The single-task ABLR based on the RKS representation\nwith D = 100 performed comparably to the one based on the NN representation with D = 50. The\ndimension D = 100 was picked after we investigated the computation time of ABLR-based HPO\nwith learned NN features (D = 50) and with RKS features (D \u2208 {50, 100, 200}) and found that\nthe running times were similar (see Figure 1c). Figure 1a also shows that multi-task ABLR did not\nbene\ufb01t much from the contextual information ct.\nWe further benchmarked single-task and transfer ABLR against the state-of-the-art NN-based ap-\nproaches DNGO [18] and BOHAMIANN [15]. In contrast to GP-based methods, all these approaches\nscale linearly in the total number of evaluations. For their implementation we used the publicly\navailable code https://github.com/automl/RoBO. We also used the recommended hyperparam-\neters, which for BOHAMIANN were 2000 batches as burn-in followed by 2500 sampling steps. All\nNN architectures consisted of three fully connected layers, each with 50 units and tanh activation\nfunctions. Even though not considered in the original work [18], we extended DNGO to the transfer\nlearning case in the same way as the GP baseline above, stacking all observations and augmenting the\n\n1The \ufb01gures are reproduced in the supplement using a log scale to emphasize the gap between the curves.\n\n6\n\n01020304050iteration505101520253035current minimumRandom searchABLR NNABLR RKSABLR NN transferABLR NN transfer (ctx)GPGP transfer (ctx)01020304050iteration505101520253035current minimumDNGODNGO transfer (ctx)BOHAMIANNBOHAMIANN transfer (ctx)ABLR NNABLR NN transferABLR NN transfer (ctx)0200040006000800010000number of evaluations0510152025303540wall-clock time (sec)ABLR RKSABLR RKS x2ABLR RKS x4ABLR NN\finput space with contextual information. Different to multi-task ABLR, a single marginal likelihood\ncriterion is used over data from all tasks.2 Results are shown in Figure 1b. The performance of\nsingle-task ABLR and BOHAMIANN is comparable (ABLR performs slightly better). DNGO and\nBOHAMIANN pro\ufb01t from transfer, yet less so than multi-task ABLR. Again, we note that the largest\nperformance gain is realized without context input ct. This suggests that multi-task ABLR learns a\nuseful joint representation through its shared feature map and better exploit similarities across tasks.\nWhile the GP-based HPO with transfer slightly outperformed multi-task ABLR on the quadratic toy\nexample, it does not scale to larger data sets, such as those considered in the next section. To make\nthis more concrete, we measured the wall-clock time taken by HPO using GP and NN-based ABLR\nin a simple single-task setting. Our simulations showed that that GP-based HPO will not scale much\nbeyond 2000 evaluations, which took approximately ten minutes, while ABLR-based HPO took only\na few seconds (we provide curves in the supplemental material). These results indicate that GP-based\nHPO is problematic when considering transfer learning at scale.\nAlthough all the considered NN-based algorithms scale linearly in N, with BOHAMIANN being\nslightly faster than DNGO (as observed in [15]), we found that our ABLR implementation requires\nmuch less computation time. More precisely, for the experiment in Figure 1b, the average time \u00b1 one\nstandard deviation per BO iteration over 300 repeated runs, on CPU, amounted to about 1.7 \u00b1 0.10\nseconds for single-task ABLR and 28\u00b1 0.15 seconds for BOHAMIANN. In the following large-scale\nOpenML experiments, we report additional time comparisons with BOHAMIANN and DNGO.\n\n(a) HPO warm-start in SVM.\n\n(b) HPO warm-start in XGBoost.\n\n(c) HPO with multiple signals.\n\nFigure 2: Transfer learning results obtained on OpenML and LIBSVM benchmarks (lower is better);\nsee respectively Section 4.3 and Section 4.4 for a discussion.\n\n4.3 Transfer learning across OpenML data sets\n\ndata sets for each flow_id, which amounts to(cid:80)\nand(cid:80)\n\nThe ability to transfer knowledge across related tasks is particularly desirable in large-scale settings.\nWhenever runs from previous optimization tasks are available, these can be used to warm-start\nand potentially speed up the current optimization. We consider the OpenML platform [32], which\ncontains a large number of evaluations for a wide range of machine learning algorithms (referred\nto as \ufb02ows in OpenML) over different data sets. We focus on some of the most popular binary\nclassi\ufb01cation \ufb02ows from OpenML, namely on a support vector machine (SVM, flow_id 5891) and\nextreme gradient boosting (XGBoost, flow_id 6767), and apply multi-task ABLR to optimize their\nhyperparameters. SVM comes with 4 and XGBoost with 10 hyperparameters. The parameters of the\nlatter two exhibit conditional relationships, which we deal with by imputation [38]. More details on\nthe OpenML set-up are given in the supplemental material. We \ufb01ltered the T = 30 most evaluated\nt Nt \u2248 6.5 \u00d7 104 (Nt \u2208 [1.087, 3.250]) for SVM\nt Nt \u2248 5.9\u00d7 105 (Nt \u2208 [10.189, 48.615]) for XGBoost. For these problems, the linear scaling\nof ABLR becomes almost mandatory. GP-based models cannot exploit all data even for a single task.\nAs previously, we apply a leave-one-task-out protocol, where each task stands for a data set. For the\nleft-out task being optimized, say t0, we use the surrogate modeling approach from [39] based on\na random forest model. We compare single-task variants GP, ABLR RKS, and ABLR NN which use\nevaluations of task t0 only, with ABLR NN transfer and BOHAMIANN transfer (ctx), warm-\nstarted with the evaluations from the other 29 tasks, and GP transfer (ctx, L1). In the last\napproach, we warm-start the GP model with 300 randomly drawn data points from the closest task in\n\n2 DNGO could also be used with a marginal likelihood criterion per task, but this would need substantial\n\nchanges to their code. Also, the set of \u03b1t and \u03b2t would have to be tuned for each task, which is impractical.\n\n7\n\n0255075100125150175200iteration8910111213(1 - AUC) * 100Random searchABLR NNABLR RKSABLR NN transferGPGP transfer (ctx, L1)BOHAMIANN transfer (ctx)0255075100125150175200iteration101112131415161718(1 - AUC) * 100Random searchABLR NNABLR RKSABLR NN transferGPGP transfer (ctx, L1)50100150200250300iteration4.04.55.05.56.06.57.07.58.0Test error * 100Random searchABLRABLR train loss (2)GPABLR cost (2)ABLR cost + train loss (3)ABLR epochwise train loss (21)\fterms of (cid:96)1 distance between contextual features (similar to [20]). In all OpenML experiments, we\nchose four contextual features: number of data points, number of features, class imbalance, and a\nlandmark feature based on Naive Bayes. We did not use those based on ensemble methods to avoid\npotential information leakage about the targets. Note that the multi-task ABLR variant is not provided\nwith context features.\nResults are reported in Figures 2a and 2b, respectively for SVM and XGBoost, averaged over 10\nrandom repetitions. The plots indicate that evaluation data from other data sets helps to speed up\nconvergence on a new task. In particular, ABLR NN transfer is able to leverage such data by way\nof learning an ef\ufb01cient shared set of features. Further experiments are found in the supplemental\nmaterial, also comparing the methods in terms of their mean ranking. We also provided the context\nvector ct as input to multi-task ABLR and optimized the hyperparameters of a random forest model,\nbut these settings did not lead to robust conclusions and further explorations are left for future work.\nAlso note that the performance of ABLR NN (D = 50) and ABLR RKS (D = 100) is comparable. The\nreal bene\ufb01t of learning features by empirical Bayes is apparent in the multi-task scenario only.\nWe tried to compare to BOHAMIANN and DNGO in the same set-up, focusing on the SVM setting\nwhose scale is smaller than that of XGBoost. As a reference, running ABLR NN and ABLR NN\ntransfer in this setting took 1.2 \u00b1 0.2 and 16.3 \u00b1 1.6 seconds per BO iteration respectively. In\ncontrast, BOHAMIANN needed 1335.7 \u00b1 236.5 seconds per BO iteration, which was about 80\ntimes more expensive. Therefore, we ran ABLR for 200 BO iterations, and limited BOHAMIANN\nto only 50 iterations due the high computational cost (see Figure 2a). Although BOHAMIANN\nwas able to greatly speed up the optimization when warm-started and provided with contextual\nfeatures, ABLR runs signi\ufb01cantly faster and does not require dataset meta-data. As for DNGO, we\ndid not succeed to run the method for more than a single iteration after which linear algebra errors\nrelated to the MCMC sampler cause the optimization to stop. This single iteration of DNGO took\n15597.6 \u00b1 5833.5 seconds, which is already 4 times as much as the total time of the 200 iterations of\nABLR NN transfer. All our measurements are made on a c4.2xlarge AWS machine.\n\n4.4 Tuning feedforward neural networks from heterogeneous signals\n\nIn a \ufb01nal experiment, we tune the parameters of feedforward NNs for binary classi\ufb01cation. We\nuse multi-task ABLR to simultaneously model T signals associated to these feedforward NNs as\noutlined in Section 3.3. More speci\ufb01cally, we are interested in optimizing the validation error (i.e.,\nthe target signal) while modelling a range of auxiliary signals alongside (i.e., training error, training\ntime, training error after e epochs). Put differently, we use the multi-task nature of ABLR to model T\nsignals, learning a NN feature basis alongside a single HPO run. Importantly, the auxiliary signals\ncome essentially for free, while most previous HPO algorithms do not seem to make use of them. Also\nnote that different to the transfer learning settings above, we always evaluate all T signals together, at\nthe same input points x. The fact that ABLR scales linearly in T allows us to consider a large number\nof auxiliary signals (in contrast, multi-output GPs scales cubically in T ). In our experiments, we tune\nfour NN parameters: number of layers in {1, . . . , 4}, number of units in {1, . . . , 50}, (cid:96)2 regularization\nconstant in {2\u22126, 2\u22125, . . . , 23}, and learning rate of Adam [40] in {2\u22126, 2\u22125, . . . , 2\u22121}.\nResults are provided in Figure 2c, averaged over 10 random repetitions and 5 data sets ({w1a,\nw8a} [41], sonar [42], phishing [43, 42], australian [44, 42]) from LIBSVM [45]. The feedfor-\nward NN was trained for 200 iterations, each time on a batch of 200 samples. All variants consider\nthe validation error as the signal of interest (and target for HPO). ABLR train loss (2) also\nuses the \ufb01nal value of the training loss, ABLR cost (2) the CPU training time, and ABLR cost\n+ train loss (3) both. Finally, ABLR cost + epochwise train loss (21) uses the cost\nand the training error collected every 10 training iterations. In the model names, the number in\nparentheses denotes the number T of signals modeled in ABLR. We can see that adding auxiliary\nsignals to a HPO run driven by ABLR NN speeds up convergence. Note that this improvement comes\nfrom adding information which is available for free. We conjecture that adding auxiliary signals,\nrelated to the criterion of interest (e.g., gradient norms) would facilitate learning a useful feature\nbasis by way of a feedforward NN, even if only one of these signals is the target of HPO. The ability\nto learn the parameters {\u03b1t, \u03b2t} per signal automatically is vital as some of the signals model with\nABLR NN have different scales (e.g., validation error versus training time).\n\n8\n\n\f5 Conclusion\n\nWe introduced multi-task adaptive Bayesian linear regression (ABLR), a novel method for Bayesian\noptimization which scales linearly in the number of observations and is speci\ufb01cally designed for\ntransfer learning in this context. Each task is modeled by a Bayesian linear regression layer on top\nof a shared feature map, learned jointly for all tasks with a deep neural net. Each Bayesian linear\nregression model comes with its own scale and noise parameters, which are learned together with the\nneural net parameters by empirical Bayes. When leveraging the auto-grad operators for the Cholesky\ndecomposition [11], we found that training is at least as fast as the two-step heuristic recommended\nin [18].\nWe applied multi-task ABLR to two transfer learning problems in HPO. First, we investigated\nwarm-starting HPO with synthetic optimization and meta-learning problems from OpenML. We\ndemonstrated that multi-task ABLR converges considerably faster than GPs or other NN-based\napproaches, and scales to much larger sets of evaluations. We attribute the success of our method\nto its ability to learn a useful representation across tasks, even in the absence of meta-data. We\nspeculate that this is due to the speci\ufb01c loss structure, which factorizes over the tasks. Multi-task\nABLR further allows meta-data to be fed as context vectors to the underlying neural net, allowing\nthe learned features to be task-speci\ufb01c without the need to design task distance metrics or requiring\nmanual tuning. Second, we investigated multi-signal HPO for feedforward neural nets, showing that\nmulti-task ABLR can leverage side-signals to speed up the optimization.\nSeveral extensions are of interest. The Bayesian linear regression layers could be complemented by\nlogistic regression layers in order to optimize binary signals or drive constrained HPO [46]. In a\nmeta-learning context, we would have to further scale multi-task ABLR to a large number of tasks, a\nregime where batch learning by L-BFGS has to be replaced by stochastic optimization at the level of\ntasks. Finally, our joint Bayesian learning for deep NNs with a \ufb01nal Bayesian layer (which requires\nback-propagation through linear algebra operators such as Cholesky) can be applied to multi-task\nactive learning or multi-label learning. Different to most other approximate Bayesian treatments of\ndeep NNs [47, 48], we do not need random sampling or loosing variational bounding, but can fully\nleverage exact inference or tight approximation developed for generalized linear models.\n\nReferences\n[1] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the human out\n\nof the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148\u2013175, 2016.\n\n[2] Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of Bayesian methods for seeking\n\nthe extremum. Towards Global Optimization, 2(117-129):2, 1978.\n\n[3] Donald R Jones, Matthias Schonlau, and William J Welch. Ef\ufb01cient global optimization of expensive\n\nblack-box functions. Journal of Global optimization, 13(4):455\u2013492, 1998.\n\n[4] Carl Rasmussen and Chris Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n\n[5] Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian\n\nprocess regression. Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[6] Michalis K Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 567\u2013574, 2009.\n\n[7] Andrew Gordon Wilson, Christoph Dann, and Hannes Nickisch. Thoughts on massively scalable Gaussian\n\nprocesses. Technical report, preprint arXiv:1511.01870, 2015.\n\n[8] R\u00e9mi Bardenet, M\u00e1ty\u00e1s Brendel, Bal\u00e1zs K\u00e9gl, and Michele Sebag. Collaborative hyperparameter tuning.\n\nIn Proceedings of the International Conference on Machine Learning (ICML), pages 199\u2013207, 2013.\n\n[9] Dani Yogatama and Gideon Mann. Ef\ufb01cient transfer learning method for automatic hyperparameter tuning.\nIn Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages\n1077\u20131085, 2014.\n\n[10] C. M. Bishop. Pattern Recognition and Machine Learning. Springer New York, 2006.\n\n[11] Matthias Seeger, Asmus Hetzel, Zhenwen Dai, and Neil D Lawrence. Auto-differentiating linear algebra.\n\nTechnical report, preprint arXiv:1710.08717, 2017.\n\n9\n\n\f[12] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan\nZhang, and Zheng Zhang. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for heterogeneous\ndistributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems,\n2015.\n\n[13] G. F. Trecate, C. K. I. Williams, and M. Opper. Finite-dimensional approximations of Gaussian processes.\n\nIn Neural Information Processing Systems 11, 1999.\n\n[14] C. Rasmussen and J. Quinonero Candela. Healing the relevance vector machine through augmentation. In\n\nInternational Conference on Machine Learning 22, 2005.\n\n[15] Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimization with\nrobust Bayesian neural networks. In Advances in Neural Information Processing Systems (NIPS), pages\n4134\u20134142, 2016.\n\n[16] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In\n\nArti\ufb01cial Intelligence in Statistics, 2016. arXiv:1511.02222.\n\n[17] Mitchell McIntire, Daniel Ratner, and Stefano Ermon. Sparse Gaussian processes for Bayesian optimization.\n\nIn Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2016.\n\n[18] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa\nPatwary, Mr Prabhat, and Ryan Adams. Scalable Bayesian optimization using deep neural networks. In\nProceedings of the International Conference on Machine Learning (ICML), pages 2171\u20132180, 2015.\n\n[19] R. M. Neal. Bayesian Learning for Neural Networks. Number 118 in Lecture Notes in Statistics. Springer,\n\n1996.\n\n[20] Matthias Feurer, T Springenberg, and Frank Hutter. Initializing Bayesian hyperparameter optimization via\n\nmeta-learning. In Proceedings of the Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[21] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 2004\u20132012, 2013.\n\n[22] Matthias Poloczek, Jialei Wang, and Peter I. Frazier. Warm starting Bayesian optimization. In Winter\nSimulation Conference, WSC 2016, Washington, DC, USA, December 11-14, 2016, pages 770\u2013781, 2016.\n\n[23] Matthias Poloczek, Jialei Wang, and Peter Frazier. Multi-information source optimization. In Advances in\n\nNeural Information Processing Systems 30, pages 4291\u20134301, 2017.\n\n[24] Nicolas Schilling, Martin Wistuba, Lucas Drumond, , and Lars Schmidt-Thieme. Hyperparameter opti-\nmization with factorized multilayer perceptrons. Proceedings of the European Conference on Machine\nLearning (ECML), pages 87\u2013103, 2015.\n\n[25] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google\nVizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 1487\u20131495, 2017.\n\n[26] Martin Wistuba, Nicolas Schilling, and Lars Schmidt-Thieme. Learning hyperparameter optimization\n\ninitializations. In DSAA, pages 1\u201310. IEEE, 2015.\n\n[27] Martin Wistuba, Nicolas Schilling, and Lars Schmidt-Thieme. Two-Stage Transfer Surrogate Model for\nAutomatic Hyperparameter Optimization. Proceedings of the European Conference on Machine Learning\n(ECML), pages 199\u2013214, 2016.\n\n[28] Matthias Feurer, Benjamin Letham, and Eytan Bakshy. Scalable Meta-Learning for Bayesian Optimization.\n\nTechnical report, Preprint arXiv:1802.02219, 2018.\n\n[29] B. Bakker and T. Heskes. Task clustering and gating for Bayesian multitask learning. Journal of Machine\n\nLearning Research, 4:83\u201389, 2003.\n\n[30] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound\n\nconstrained optimization. SIAM Journal on Scienti\ufb01c Computing, 16(5):1190\u20131208, 1995.\n\n[31] David J. C. Mackay. Information Theory, Inference and Learning Algorithms. Cambridge University Press,\n\n2003.\n\n[32] Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in\n\nmachine learning. ACM SIGKDD Explorations Newsletter, 15(2):49\u201360, 2014.\n\n10\n\n\f[33] GPyOpt: A Bayesian optimization framework in Python. http://github.com/SheffieldML/GPyOpt,\n\n2016.\n\n[34] Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In Advances in\n\nNeural Information Processing Systems (NIPS) 20, pages 1177\u20131184, 2007.\n\n[35] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, James Requeima, Edward O Pyzer-Knapp, and Al\u00e1n Aspuru-Guzik.\nParallel and distributed Thompson sampling for large-scale accelerated exploration of chemical space. In\nProceedings of the International Conference on Machine Learning (ICML), 2017.\n\n[36] R. Jenatton, C. Archambeau, J. Gonzales, and M. Seeger. Bayesian optimization with tree-structured\n\ndependencies. In Proceedings of the International Conference on Machine Learning (ICML), 2017.\n\n[37] Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 2447\u20132455, 2011.\n\n[38] Julien-Charles L\u00e9vesque, Audrey Durand, Christian Gagn\u00e9, and Robert Sabourin. Bayesian optimization\nfor conditional hyperparameter spaces. In International Joint Conference on Neural Networks (IJCNN),\n2017.\n\n[39] K Eggensperger, F Hutter, HH Hoos, and K Leyton-brown. Ef\ufb01cient benchmarking of hyperparameter\noptimizers via surrogates background: Hyperparameter optimization. In Proceedings of the 29th AAAI\nConference on Arti\ufb01cial Intelligence, pages 1114\u20131120, 2012.\n\n[40] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Technical report, preprint\n\narXiv:1412.6980, 2014.\n\n[41] John C Platt.\n\nfast training of support vector machines using sequential minimal optimization, pages\n\n185\u2013208. MIT Press, 1999.\n\n[42] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017.\n\n[43] Rami M Mohammad, Fadi Thabtah, and Lee McCluskey. An assessment of features related to phish-\ning websites using an automated technique. In Internet Technology And Secured Transactions, 2012\nInternational Conference for, pages 492\u2013497. IEEE, 2012.\n\n[44] Quilan J.R. Simplifying decision trees. International journal of man-machine studies, 1987.\n\n[45] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions\n\non Intelligent Systems and Technology, 2:27:1\u201327:27, 2011.\n\n[46] Michael A Gelbart, Jasper Snoek, and Ryan P Adams. Bayesian optimization with unknown constraints.\n\nTechnical report, preprint arXiv:1403.5607, 2014.\n\n[47] D. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Representa-\n\ntion Learning, 2014.\n\n[48] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and variational inference in deep\n\nlatent Gaussian models. In International Conference on Machine Learning 31, 2014.\n\n11\n\n\f", "award": [], "sourceid": 3434, "authors": [{"given_name": "Valerio", "family_name": "Perrone", "institution": "University of Warwick"}, {"given_name": "Rodolphe", "family_name": "Jenatton", "institution": "Amazon Research"}, {"given_name": "Matthias", "family_name": "Seeger", "institution": "Amazon Development Center"}, {"given_name": "Cedric", "family_name": "Archambeau", "institution": "Amazon"}]}