{"title": "Algorithms for Hyper-Parameter Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2546, "page_last": 2554, "abstract": "Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [Larochelle et al., 2007] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements.", "full_text": "Algorithms for Hyper-Parameter Optimization\n\nJames Bergstra\n\nThe Rowland Institute\n\nHarvard University\n\nbergstra@rowland.harvard.edu\n\nR\u00b4emi Bardenet\n\nLaboratoire de Recherche en Informatique\n\nUniversit\u00b4e Paris-Sud\nbardenet@lri.fr\n\nYoshua Bengio\n\nD\u00b4ept. d\u2019Informatique et Recherche Op\u00b4erationelle\n\nUniversit\u00b4e de Montr\u00b4eal\n\nyoshua.bengio@umontreal.ca\n\nBal\u00b4azs K\u00b4egl\n\nLinear Accelerator Laboratory\nUniversit\u00b4e Paris-Sud, CNRS\n\nbalazs.kegl@gmail.com\n\nAbstract\n\nSeveral recent advances to the state of the art in image classi\ufb01cation benchmarks\nhave come from better con\ufb01gurations of existing techniques rather than novel ap-\nproaches to feature learning. Traditionally, hyper-parameter optimization has been\nthe job of humans because they can be very ef\ufb01cient in regimes where only a few\ntrials are possible. Presently, computer clusters and GPU processors make it pos-\nsible to run more trials and we show that algorithmic approaches can \ufb01nd better\nresults. We present hyper-parameter optimization results on tasks of training neu-\nral networks and deep belief networks (DBNs). We optimize hyper-parameters\nusing random search and two new greedy sequential methods based on the ex-\npected improvement criterion. Random search has been shown to be suf\ufb01ciently\nef\ufb01cient for learning neural networks for several datasets, but we show it is unreli-\nable for training DBNs. The sequential algorithms are applied to the most dif\ufb01cult\nDBN learning problems from [1] and \ufb01nd signi\ufb01cantly better results than the best\npreviously reported. This work contributes novel techniques for making response\nsurface models P (y|x) in which many elements of hyper-parameter assignment\n(x) are known to be irrelevant given particular values of other elements.\n\n1\n\nIntroduction\n\nModels such as Deep Belief Networks (DBNs) [2], stacked denoising autoencoders [3], convo-\nlutional networks [4], as well as classi\ufb01ers based on sophisticated feature extraction techniques\nhave from ten to perhaps \ufb01fty hyper-parameters, depending on how the experimenter chooses to\nparametrize the model, and how many hyper-parameters the experimenter chooses to \ufb01x at a rea-\nsonable default. The dif\ufb01culty of tuning these models makes published results dif\ufb01cult to reproduce\nand extend, and makes even the original investigation of such methods more of an art than a science.\nRecent results such as [5], [6], and [7] demonstrate that the challenge of hyper-parameter opti-\nmization in large and multilayer models is a direct impediment to scienti\ufb01c progress. These works\nhave advanced state of the art performance on image classi\ufb01cation problems by more concerted\nhyper-parameter optimization in simple algorithms, rather than by innovative modeling or machine\nlearning strategies. It would be wrong to conclude from a result such as [5] that feature learning\nis useless. Instead, hyper-parameter optimization should be regarded as a formal outer loop in the\nlearning process. A learning algorithm, as a functional from data to classi\ufb01er (taking classi\ufb01cation\nproblems as an example), includes a budgeting choice of how many CPU cycles are to be spent\non hyper-parameter exploration, and how many CPU cycles are to be spent evaluating each hyper-\nparameter choice (i.e. by tuning the regular parameters). The results of [5] and [7] suggest that\nwith current generation hardware such as large computer clusters and GPUs, the optimal alloca-\n\n1\n\n\ftion of CPU cycles includes more hyper-parameter exploration than has been typical in the machine\nlearning literature.\nHyper-parameter optimization is the problem of optimizing a loss function over a graph-structured\ncon\ufb01guration space. In this work we restrict ourselves to tree-structured con\ufb01guration spaces. Con-\n\ufb01guration spaces are tree-structured in the sense that some leaf variables (e.g. the number of hidden\nunits in the 2nd layer of a DBN) are only well-de\ufb01ned when node variables (e.g. a discrete choice of\nhow many layers to use) take particular values. Not only must a hyper-parameter optimization algo-\nrithm optimize over variables which are discrete, ordinal, and continuous, but it must simultaneously\nchoose which variables to optimize.\nIn this work we de\ufb01ne a con\ufb01guration space by a generative process for drawing valid samples.\nRandom search is the algorithm of drawing hyper-parameter assignments from that process and\nevaluating them. Optimization algorithms work by identifying hyper-parameter assignments that\ncould have been drawn, and that appear promising on the basis of the loss function\u2019s value at other\npoints. This paper makes two contributions: 1) Random search is competitive with the manual\noptimization of DBNs in [1], and 2) Automatic sequential optimization outperforms both manual\nand random search.\nSection 2 covers sequential model-based optimization, and the expected improvement criterion. Sec-\ntion 3 introduces a Gaussian Process based hyper-parameter optimization algorithm. Section 4 in-\ntroduces a second approach based on adaptive Parzen windows. Section 5 describes the problem of\nDBN hyper-parameter optimization, and shows the ef\ufb01ciency of random search. Section 6 shows\nthe ef\ufb01ciency of sequential optimization on the two hardest datasets according to random search.\nThe paper concludes with discussion of results and concluding remarks in Section 7 and Section 8.\n\n2 Sequential Model-based Global Optimization\n\nSequential Model-Based Global Optimization (SMBO) algorithms have been used in many applica-\ntions where evaluation of the \ufb01tness function is expensive [8, 9]. In an application where the true\n\ufb01tness function f : X \u2192 R is costly to evaluate, model-based algorithms approximate f with a sur-\nrogate that is cheaper to evaluate. Typically the inner loop in an SMBO algorithm is the numerical\noptimization of this surrogate, or some transformation of the surrogate. The point x\u2217 that maximizes\nthe surrogate (or its transformation) becomes the proposal for where the true function f should be\nevaluated. This active-learning-like algorithm template is summarized in Figure 1. SMBO algo-\nrithms differ in what criterion they optimize to obtain x\u2217 given a model (or surrogate) of f, and in\nthey model f via observation history H.\n\nSMBO(cid:0)f, M0, T, S(cid:1)\n\nH \u2190 \u2205,\nFor t \u2190 1 to T ,\n\n1\n2\n3\n4\n5\n6\n7\nFigure 1: The pseudo-code of generic Sequential Model-Based Optimization.\n\nx\u2217 \u2190 argminx S(x, Mt\u22121),\nEvaluate f (x\u2217),\nH \u2190 H \u222a (x\u2217, f (x\u2217)),\nFit a new model Mt to H.\n\n(cid:46) Expensive step\n\nreturn H\n\nThe algorithms in this work optimize the criterion of Expected Improvement (EI) [10]. Other cri-\nteria have been suggested, such as Probability of Improvement and Expected Improvement [10],\nminimizing the Conditional Entropy of the Minimizer [11], and the bandit-based criterion described\nin [12]. We chose to use the EI criterion in our work because it is intuitive, and has been shown to\nwork well in a variety of settings. We leave the systematic exploration of improvement criteria for\nfuture work. Expected improvement is the expectation under some model M of f : X \u2192 RN that\nf (x) will exceed (negatively) some threshold y\u2217:\n\n(cid:90) \u221e\n\n\u2212\u221e\n\nEIy\u2217 (x) :=\n\nmax(y\u2217 \u2212 y, 0)pM (y|x)dy.\n\n(1)\n\n2\n\n\fThe contribution of this work is two novel strategies for approximating f by modeling H: a hier-\narchical Gaussian Process and a tree-structured Parzen estimator. These are described in Section 3\nand Section 4 respectively.\n\n3 The Gaussian Process Approach (GP)\n\nGaussian Processes have long been recognized as a good method for modeling loss functions in\nmodel-based optimization literature [13]. Gaussian Processes (GPs, [14]) are priors over functions\nthat are closed under sampling, which means that if the prior distribution of f is believed to be a GP\nwith mean 0 and kernel k, the conditional distribution of f knowing a sample H = (xi, f (xi))n\ni=1\nof its values is also a GP, whose mean and covariance function are analytically derivable. GPs with\ngeneric mean functions can in principle be used, but it is simpler and suf\ufb01cient for our purposes\nto only consider zero mean processes. We do this by centering the function values in the consid-\nered data sets. Modelling e.g. linear trends in the GP mean leads to undesirable extrapolation in\nunexplored regions during SMBO [15].\n\nThe above mentioned closedness property, along with the fact that GPs provide an assessment of\nprediction uncertainty incorporating the effect of data scarcity, make the GP an elegant candidate\nfor both \ufb01nding candidate x\u2217 (Figure 1, step 3) and \ufb01tting a model Mt (Figure 1, step 6). The runtime\nof each iteration of the GP approach scales cubically in |H| and linearly in the number of variables\nbeing optimized, however the expense of the function evaluations f (x\u2217) typically dominate even\nthis cubic cost.\n\n3.1 Optimizing EI in the GP\nWe model f with a GP and set y\u2217 to the best value found after observing H: y\u2217 = min{f (xi), 1 \u2264\ni \u2264 n}. The model pM in (1) is then the posterior GP knowing H. The EI function in (1) encap-\nsulates a compromise between regions where the mean function is close to or better than y\u2217 and\nunder-explored regions where the uncertainty is high.\n\nEI functions are usually optimized with an exhaustive grid search over the input space, or a Latin\nHypercube search in higher dimensions. However, some information on the landscape of the EI cri-\nterion can be derived from simple computations [16]: 1) it is always non-negative and zero at training\npoints from D, 2) it inherits the smoothness of the kernel k, which is in practice often at least once\ndifferentiable, and noticeably, 3) the EI criterion is likely to be highly multi-modal, especially as\nthe number of training points increases. The authors of [16] used the preceding remarks on the\nlandscape of EI to design an evolutionary algorithm with mixture search, speci\ufb01cally aimed at opti-\nmizing EI, that is shown to outperform exhaustive search for a given budget in EI evaluations. We\nborrow here their approach and go one step further. We keep the Estimation of Distribution (EDA,\n[17]) approach on the discrete part of our input space (categorical and discrete hyper-parameters),\nwhere we sample candidate points according to binomial distributions, while we use the Covariance\nMatrix Adaptation - Evolution Strategy (CMA-ES, [18]) for the remaining part of our input space\n(continuous hyper-parameters). CMA-ES is a state-of-the-art gradient-free evolutionary algorithm\nfor optimization on continuous domains, which has been shown to outperform the Gaussian search\nEDA. Notice that such a gradient-free approach allows non-differentiable kernels for the GP regres-\nsion. We do not take on the use of mixtures in [16], but rather restart the local searches several times,\nstarting from promising places. The use of tesselations suggested by [16] is prohibitive here, as our\ntask often means working in more than 10 dimensions, thus we start each local search at the center\nof mass of a simplex with vertices randomly picked among the training points.\n\nFinally, we remark that all hyper-parameters are not relevant for each point. For example, a DBN\nwith only one hidden layer does not have parameters associated to a second or third layer. Thus it\nis not enough to place one GP over the entire space of hyper-parameters. We chose to group the\nhyper-parameters by common use in a tree-like fashion and place different independent GPs over\neach group. As an example, for DBNs, this means placing one GP over common hyper-parameters,\nincluding categorical parameters that indicate what are the conditional groups to consider, three\nGPs on the parameters corresponding to each of the three layers, and a few 1-dimensional GPs over\nindividual conditional hyper-parameters, like ZCA energy (see Table 1 for DBN parameters).\n\n3\n\n\f4 Tree-structured Parzen Estimator Approach (TPE)\nAnticipating that our hyper-parameter optimization tasks will mean high dimensions and small \ufb01t-\nness evaluation budgets, we now turn to another modeling strategy and EI optimization scheme for\nthe SMBO algorithm. Whereas the Gaussian-process based approach modeled p(y|x) directly, this\nstrategy models p(x|y) and p(y).\nRecall from the introduction that the con\ufb01guration space X is described by a graph-structured gen-\nerative process (e.g. \ufb01rst choose a number of DBN layers, then choose the parameters for each).\nThe tree-structured Parzen estimator (TPE) models p(x|y) by transforming that generative process,\nreplacing the distributions of the con\ufb01guration prior with non-parametric densities. In the exper-\nimental section, we will see that the con\ufb01guation space is described using uniform, log-uniform,\nquantized log-uniform, and categorical variables.\nIn these cases, the TPE algorithm makes the\nfollowing replacements: uniform \u2192 truncated Gaussian mixture, log-uniform \u2192 exponentiated\ntruncated Gaussian mixture, categorical \u2192 re-weighted categorical. Using different observations\n{x(1), ..., x(k)} in the non-parametric densities, these substitutions represent a learning algorithm\nthat can produce a variety of densities over the con\ufb01guration space X . The TPE de\ufb01nes p(x|y)\nusing two such densities:\n\ng(x)\n\n(2)\nwhere (cid:96)(x) is the density formed by using the observations {x(i)} such that corresponding loss\nf (x(i)) was less than y\u2217 and g(x) is the density formed by using the remaining observations.\nWhereas the GP-based approach favoured quite an aggressive y\u2217 (typically less than the best ob-\nserved loss), the TPE algorithm depends on a y\u2217 that is larger than the best observed f (x) so that\nsome points can be used to form (cid:96)(x). The TPE algorithm chooses y\u2217 to be some quantile \u03b3 of the\nobserved y values, so that p(y < y\u2217) = \u03b3, but no speci\ufb01c model for p(y) is necessary. By maintain-\ning sorted lists of observed variables in H, the runtime of each iteration of the TPE algorithm can\nscale linearly in |H| and linearly in the number of variables (dimensions) being optimized.\n\n(cid:26)(cid:96)(x)\n\np(x|y) =\n\nif y < y\u2217\nif y \u2265 y\u2217,\n\n(cid:90) y\u2217\n\n\u2212\u221e\n\n(cid:90) y\u2217\n\n\u2212\u221e\n\n4.1 Optimizing EI in the TPE algorithm\nThe parametrization of p(x, y) as p(y)p(x|y) in the TPE algorithm was chosen to facilitate the\noptimization of EI.\n\n\u2212\u221e\n\n\u2212\u221e\n\np(x|y)p(y)\n\np(x)\n\ndy\n\np(y)dy,\n\nEIy\u2217 (x) =\n\n(cid:17)\u22121\n\n(y\u2217 \u2212 y)\n\n(cid:90) y\u2217\n\n\u2212\u221e\n\n\u03b3(cid:96)(x)+(1\u2212\u03b3)g(x)\n\n\u2212\u221e p(y)dy\n\n\u03b3 + g(x)\n\n(cid:96)(x) (1 \u2212 \u03b3)\n\n\u221d (cid:16)\n\n(y\u2217 \u2212 y)p(y|x)dy =\n\nBy construction, \u03b3 = p(y < y\u2217) and p(x) =(cid:82)\n(3)\n(cid:90) y\u2217\n(cid:90) y\u2217\nR p(x|y)p(y)dy = \u03b3(cid:96)(x) + (1 \u2212 \u03b3)g(x). Therefore\n(y\u2217 \u2212 y)p(x|y)p(y)dy = (cid:96)(x)\n(y\u2217 \u2212 y)p(y)dy = \u03b3y\u2217(cid:96)(x) \u2212 (cid:96)(x)\n\u03b3y\u2217(cid:96)(x)\u2212(cid:96)(x)(cid:82) y\u2217\n\n. This last expression\nso that \ufb01nally EIy\u2217 (x) =\nshows that to maximize improvement we would like points x with high probability under (cid:96)(x)\nand low probability under g(x). The tree-structured form of (cid:96) and g makes it easy to draw many\ncandidates according to (cid:96) and evaluate them according to g(x)/(cid:96)(x). On each iteration, the algorithm\nreturns the candidate x\u2217 with the greatest EI.\n4.2 Details of the Parzen Estimator\nThe models (cid:96)(x) and g(x) are hierarchical processes involving discrete-valued and continuous-\nvalued variables. The Adaptive Parzen Estimator yields a model over X by placing density in\nthe vicinity of K observations B = {x(1), ..., x(K)} \u2282 H. Each continuous hyper-parameter was\nspeci\ufb01ed by a uniform prior over some interval (a, b), or a Gaussian, or a log-uniform distribution.\nThe TPE substitutes an equally-weighted mixture of that prior with Gaussians centered at each of\nthe x(i) \u2208 B. The standard deviation of each Gaussian was set to the greater of the distances to the\nleft and right neighbor, but clipped to remain in a reasonable range. In the case of the uniform, the\npoints a and b were considered to be potential neighbors. For discrete variables, supposing the prior\nwas a vector of N probabilities pi, the posterior vector elements were proportional to N pi + Ci\nwhere Ci counts the occurrences of choice i in B. The log-uniform hyper-parameters were treated\nas uniforms in the log domain.\n\n4\n\n\fTable 1: Distribution over DBN hyper-parameters for random sampling. Options separated by \u201cor\u201d\nsuch as pre-processing (and including the random seed) are weighted equally. Symbol U means\nuniform, N means Gaussian-distributed, and log U means uniformly distributed in the log-domain.\nCD (also known as CD-1) stands for contrastive divergence, the algorithm used to initialize the layer\nparameters of the DBN.\n\nWhole model\n\nParameter\npre-processing\nZCA energy\nrandom seed\nclassi\ufb01er learn rate\nclassi\ufb01er anneal start\nclassi\ufb01er (cid:96)2-penalty\nn. layers\nbatch size\n\nPrior\nraw or ZCA\nU (.5, 1)\n5 choices\nlog U (0.001, 10)\nlog U (100, 104)\n0 or log U (10\u22127, 10\u22124)\n1 to 3\n20 or 100\n\nParameter\nn. hidden units\nW init\na\nalgo A coef\nCD epochs\nCD learn rate\nCD anneal start\nCD sample data\n\nPer-layer\nPrior\nlog U (128, 4096)\nU (\u2212a, a) or N (0, a2)\nalgo A or B (see text)\nU (.2, 2)\nlog U (1, 104)\nlog U (10\u22124, 1)\nlog U (10, 104)\nyes or no\n\n5 Random Search for Hyper-Parameter Optimization in DBNs\n\nOne simple, but recent step toward formalizing hyper-parameter optimization is the use of random\nsearch [5]. [19] showed that random search was much more ef\ufb01cient than grid search for optimizing\nthe parameters of one-layer neural network classi\ufb01ers. In this section, we evaluate random search\nfor DBN optimization, compared with the sequential grid-assisted manual search carried out in [1].\nWe chose the prior listed in Table 1 to de\ufb01ne the search space over DBN con\ufb01gurations. The details\nof the datasets, the DBN model, and the greedy layer-wise training procedure based on CD are\nprovided in [1]. This prior corresponds to the search space of [1] except for the following differences:\n(a) we allowed for ZCA pre-processing [20], (b) we allowed for each layer to have a different size,\n(c) we allowed for each layer to have its own training parameters for CD, (d) we allowed for the\npossibility of treating the continuous-valued data as either as Bernoulli means (more theoretically\ncorrect) or Bernoulli samples (more typical) in the CD algorithm, and (e) we did not discretize the\npossible values of real-valued hyper-parameters. These changes expand the hyper-parameter search\nproblem, while maintaining the original hyper-parameter search space as a subset of the expanded\nsearch space.\nThe results of this preliminary random search are in Figure 2. Perhaps surprisingly, the result of\nmanual search can be reliably matched with 32 random trials for several datasets. The ef\ufb01ciency\nof random search in this setting is explored further in [21]. Where random search results match\nhuman performance, it is not clear from Figure 2 whether the reason is that it searched the original\nspace as ef\ufb01ciently, or that it searched a larger space where good performance is easier to \ufb01nd. But\nthe objection that random search is somehow cheating by searching a larger space is backward \u2013\nthe search space outlined in Table 1 is a natural description of the hyper-parameter optimization\nproblem, and the restrictions to that space by [1] were presumably made to simplify the search\nproblem and make it tractable for grid-search assisted manual search. Critically, both methods train\nDBNs on the same datasets.\nThe results in Figure 2 indicate that hyper-parameter optimization is harder for some datasets. For\nexample, in the case of the \u201cMNIST rotated background images\u201d dataset (MRBI), random sampling\nappears to converge to a maximum relatively quickly (best models among experiments of 32 trials\nshow little variance in performance), but this plateau is lower than what was found by manual search.\nIn another dataset (convex), the random sampling procedure exceeds the performance of manual\nsearch, but is slow to converge to any sort of plateau. There is considerable variance in generalization\nwhen the best of 32 models is selected. This slow convergence indicates that better performance is\nprobably available, but we need to search the con\ufb01guration space more ef\ufb01ciently to \ufb01nd it. The\nremainder of this paper explores sequential optimization strategies for hyper-parameter optimization\nfor these two datasets: convex and MRBI.\n\n6 Sequential Search for Hyper-Parameter Optimization in DBNs\n\nWe validated our GP approach of Section 3.1 by comparing with random sampling on the Boston\nHousing dataset, a regression task with 506 points made of 13 scaled input variables and a scalar\n\n5\n\n\fFigure 2: Deep Belief Network (DBN) performance according to random search. Random\nsearch is used to explore up to 32 hyper-parameters (see Table 1). Results found using a\ngrid-search-assisted manual search over a similar domain with an average 41 trials are\ngiven in green (1-layer DBN) and red (3-layer DBN). Each box-plot (for N = 1, 2, 4, ...)\nshows the distribution of test set performance when the best model among N random trials\nis selected. The datasets \u201cconvex\u201d and \u201cmnist rotated background images\u201d are used for\nmore thorough hyper-parameter optimization.\n\nregressed output. We trained a Multi-Layer Perceptron (MLP) with 10 hyper-parameters, including\nlearning rate, (cid:96)1 and (cid:96)2 penalties, size of hidden layer, number of iterations, whether a PCA pre-\nprocessing was to be applied, whose energy was the only conditional hyper-parameter [22]. Our\nresults are depicted in Figure 3. The \ufb01rst 30 iterations were made using random sampling, while\nfrom the 30th on, we differentiated the random samples from the GP approach trained on the updated\nhistory. The experiment was repeated 20 times. Although the number of points is particularly small\ncompared to the dimensionality, the surrogate modelling approach \ufb01nds noticeably better points than\nrandom, which supports the application of SMBO approaches to more ambitious tasks and datasets.\nApplying the GP to the problem of optimizing DBN performance, we allowed 3 random restarts to\nthe CMA+ES algorithm per proposal x\u2217, and up to 500 iterations of conjugate gradient method in\n\ufb01tting the length scales of the GP. The squared exponential kernel [14] was used for every node.\nThe CMA-ES part of GPs dealt with boundaries using a penalty method, the binomial sampling part\ndealt with it by nature. The GP algorithm was initialized with 30 randomly sampled points in H.\nAfter 200 trials, the prediction of a point x\u2217 using this GP took around 150 seconds.\nFor the TPE-based algorithm, we chose \u03b3 = 0.15 and picked the best among 100 candidates drawn\nfrom (cid:96)(x) on each iteration as the proposal x\u2217. After 200 trials, the prediction of a point x\u2217 using\nthis TPE algorithm took around 10 seconds. TPE was allowed to grow past the initial bounds used\nwith for random sampling in the course of optimization, whereas the GP and random search were\nrestricted to stay within the initial bounds throughout the course of optimization. The TPE algorithm\nwas also initialized with the same 30 randomly sampled points as were used to seed the GP.\n\n6.1 Parallelizing Sequential Search\nBoth the GP and TPE approaches were actually run asynchronously in order to make use of multiple\ncompute nodes and to avoid wasting time waiting for trial evaluations to complete. For the GP ap-\nproach, the so-called constant liar approach was used: each time a candidate point x\u2217 was proposed,\na fake \ufb01tness evaluation equal to the mean of the y\u2019s within the training set D was assigned tem-\nporarily, until the evaluation completed and reported the actual loss f (x\u2217). For the TPE approach,\nwe simply ignored recently proposed points and relied on the stochasticity of draws from (cid:96)(x) to\nprovide different candidates from one iteration to the next. The consequence of parallelization is\nthat each proposal x\u2217 is based on less feedback. This makes search less ef\ufb01cient, though faster in\nterms of wall time.\n\n6\n\n1248163264128experimentsize(#trials)0.00.20.40.60.81.0accuracymnistbasic1248163264128experimentsize(#trials)0.00.10.20.30.40.50.60.70.80.9accuracymnistbackgroundimages1248163264128experimentsize(#trials)0.00.10.20.30.40.50.6accuracymnistrotatedbackgroundimages1248163264128experimentsize(#trials)0.450.500.550.600.650.700.750.800.85accuracyconvex1248163264128experimentsize(#trials)0.40.50.60.70.80.91.0accuracyrectangles1248163264128experimentsize(#trials)0.450.500.550.600.650.700.750.80accuracyrectanglesimages\fMRBI\n\nconvex\n14.13 \u00b10.30 % 44.55 \u00b10.44%\nTPE\n16.70 \u00b1 0.32% 47.08 \u00b1 0.44%\nGP\n18.63 \u00b1 0.34% 47.39 \u00b1 0.44%\nManual\nRandom 18.97 \u00b1 0.34 % 50.52 \u00b1 0.44%\n\nTable 2: The test set classi\ufb01cation error of\nthe best model found by each search algo-\nrithm on each problem. Each search algo-\nrithm was allowed up to 200 trials. The man-\nual searches used 82 trials for convex and 27\ntrials MRBI.\n\nFigure 3: After time 30, GP optimizing\nthe MLP hyper-parameters on the Boston\nHousing regression task. Best minimum\nfound so far every 5 iterations, against\ntime. Red = GP, Blue = Random. Shaded\nareas = one-sigma error bars.\n\nRuntime per trial was limited to 1 hour of GPU computation regardless of whether execution was on\na GTX 285, 470, 480, or 580. The difference in speed between the slowest and fastest machine was\nroughly two-fold in theory, but the actual ef\ufb01ciency of computation depended also on the load of the\nmachine and the con\ufb01guration of the problem (the relative speed of the different cards is different in\ndifferent hyper-parameter con\ufb01gurations). With the parallel evaluation of up to \ufb01ve proposals from\nthe GP and TPE algorithms, each experiment took about 24 hours of wall time using \ufb01ve GPUs.\n\n7 Discussion\nThe trajectories (H) constructed by each algorithm up to 200 steps are illustrated in Figure 4, and\ncompared with random search and the manual search carried out in [1]. The generalization scores\nof the best models found using these algorithms and others are listed in Table 2. On the convex\ndataset (2-way classi\ufb01cation), both algorithms converged to a validation score of 13% error.\nIn\ngeneralization, TPE\u2019s best model had 14.1% error and GP\u2019s best had 16.7%. TPE\u2019s best was sig-\nni\ufb01cantly better than both manual search (19%) and random search with 200 trials (17%). On the\nMRBI dataset (10-way classi\ufb01cation), random search was the worst performer (50% error), the GP\napproach and manual search approximately tied (47% error), while the TPE algorithm found a new\nbest result (44% error). The models found by the TPE algorithm in particular are better than pre-\nviously found ones on both datasets. The GP and TPE algorithms were slightly less ef\ufb01cient than\nmanual search: GP and EI identi\ufb01ed performance on par with manual search within 80 trials, the\nmanual search of [1] used 82 trials for convex and 27 trials for MRBI.\nThere are several possible reasons for why the TPE approach outperformed the GP approach in\nthese two datasets. Perhaps the inverse factorization of p(x|y) is more accurate than the p(y|x) in\nthe Gaussian process. Perhaps, conversely, the exploration induced by the TPE\u2019s lack of accuracy\nturned out to be a good heuristic for search. Perhaps the hyper-parameters of the GP approach itself\nwere not set to correctly trade off exploitation and exploration in the DBN con\ufb01guration space. More\nempirical work is required to test these hypotheses. Critically though, all four SMBO runs matched\nor exceeded both random search and a careful human-guided search, which are currently the state\nof the art methods for hyper-parameter optimization.\nThe GP and TPE algorithms work well in both of these settings, but there are certainly settings\nin which these algorithms, and in fact SMBO algorithm in general, would not be expected to do\nwell. Sequential optimization algorithms work by leveraging structure in observed (x, y) pairs. It is\npossible for SMBO to be arbitrarily bad with a bad choice of p(y|x). It is also possible to be slower\nthan random sampling at \ufb01nding a global optimum with a apparently good p(y|x), if it extracts\nstructure in H that leads only to a local optimum.\n8 Conclusion\n\nThis paper has introduced two sequential hyper-parameter optimization algorithms, and shown them\nto meet or exceed human performance and the performance of a brute-force random search in two\ndif\ufb01cult hyper-parameter optimization tasks involving DBNs. We have relaxed standard constraints\n(e.g. equal layer sizes at all layers) on the search space, and fall back on a more natural hyper-\nparameter space of 32 variables (including both discrete and continuous variables) in which many\n\n7\n\n0102030405014161820222426TimeBestvaluesofar\fFigure 4: Ef\ufb01ciency of Gaussian Process-based (GP) and graphical model-based (TPE) se-\nquential optimization algorithms on the task of optimizing the validation set performance\nof a DBN of up to three layers on the convex task (left) and the MRBI task (right). The\ndots are the elements of the trajectory H produced by each SMBO algorithm. The solid\ncoloured lines are the validation set accuracy of the best trial found before each point in\ntime. Both the TPE and GP algorithms make signi\ufb01cant advances from their random ini-\ntial conditions, and substantially outperform the manual and random search methods. A\n95% con\ufb01dence interval about the best validation means on the convex task extends 0.018\nabove and below each point, and on the MRBI task extends 0.021 above and below each\npoint. The solid black line is the test set accuracy obtained by domain experts using a\ncombination of grid search and manual search [1]. The dashed line is the 99.5% quan-\ntile of validation performance found among trials sampled from our prior distribution (see\nTable 1), estimated from 457 and 361 random trials on the two datasets respectively.\n\nvariables are sometimes irrelevant, depending on the value of other parameters (e.g. the number of\nlayers). In this 32-dimensional search problem, the TPE algorithm presented here has uncovered new\nbest results on both of these datasets that are signi\ufb01cantly better than what DBNs were previously\nbelieved to achieve. Moreover, the GP and TPE algorithms are practical: the optimization for each\ndataset was done in just 24 hours using \ufb01ve GPU processors. Although our results are only for\nDBNs, our methods are quite general, and extend naturally to any hyper-parameter optimization\nproblem in which the hyper-parameters are drawn from a measurable set.\nWe hope that our work may spur researchers in the machine learning community to treat the hyper-\nparameter optimization strategy as an interesting and important component of all learning algo-\nrithms. The question of \u201cHow well does a DBN do on the convex task?\u201d is not a fully speci\ufb01ed,\nempirically answerable question \u2013 different approaches to hyper-parameter optimization will give\ndifferent answers. Algorithmic approaches to hyper-parameter optimization make machine learning\nresults easier to disseminate, reproduce, and transfer to other domains. The speci\ufb01c algorithms we\nhave presented here are also capable, at least in some cases, of \ufb01nding better results than were pre-\nviously known. Finally, powerful hyper-parameter optimization algorithms broaden the horizon of\nmodels that can realistically be studied; researchers need not restrict themselves to systems of a few\nvariables that can readily be tuned by hand.\nThe TPE algorithm presented in this work, as well as parallel evaluation infrastructure, is available\nas BSD-licensed free open-source software, which has been designed not only to reproduce the\nresults in this work, but also to facilitate the application of these and similar algorithms to other\nhyper-parameter optimization problems.1\n\nAcknowledgements\n\nThis work was supported by the National Science and Engineering Research Council of Canada,\nCompute Canada, and by the ANR-2010-COSI-002 grant of the French National Research Agency.\nGPU implementations of the DBN model were provided by Theano [23].\n\n1\u201cHyperopt\u201d software package: https://github.com/jaberg/hyperopt\n\n8\n\n050100150200time(trials)0.150.200.250.300.350.400.450.50error(fractionincorrect)Dataset:convexmanual99.5\u2019thq.GPTPE050100150200time(trials)0.50.60.70.80.9error(fractionincorrect)Dataset:mnistrotatedbackgroundimagesmanual99.5\u2019thq.GPTPE\fReferences\n[1] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep\n\narchitectures on problems with many factors of variation. In ICML 2007, pages 473\u2013480, 2007.\n\n[2] G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation,\n\n18:1527\u20131554, 2006.\n\n[3] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol. Stacked denoising autoencoders:\nLearning useful representations in a deep network with a local denoising criterion. Machine Learning\nResearch, 11:3371\u20133408, 2010.\n\n[4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[5] Nicolas Pinto, David Doukhan, James J. DiCarlo, and David D. Cox. A high-throughput screening ap-\nproach to discovering good forms of biologically inspired visual representation. PLoS Comput Biol,\n5(11):e1000579, 11 2009.\n\n[6] A. Coates, H. Lee, and A. Ng. An analysis of single-layer networks in unsupervised feature learning.\n\nNIPS Deep Learning and Unsupervised Feature Learning Workshop, 2010.\n\n[7] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and vector\nquantization. In Proceedings of the Twenty-eighth International Conference on Machine Learning (ICML-\n11), 2010.\n\n[8] F. Hutter. Automated Con\ufb01guration of Algorithms for Solving Hard Computational Problems. PhD thesis,\n\nUniversity of British Columbia, 2009.\n\n[9] F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm\n\ncon\ufb01guration. In LION-5, 2011. Extended version as UBC Tech report TR-2010-10.\n\n[10] D.R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global\n\nOptimization, 21:345\u2013383, 2001.\n\n[11] J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global optimization of\n\nexpensive-to-evaluate functions. Journal of Global Optimization, 2006.\n\n[12] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting:\n\nNo regret and experimental design. In ICML, 2010.\n\n[13] J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the extremum.\nIn L.C.W. Dixon and G.P. Szego, editors, Towards Global Optimization, volume 2, pages 117\u2013129. North\nHolland, New York, 1978.\n\n[14] C.E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning.\n[15] D. Ginsbourger, D. Dupuy, A. Badea, L. Carraro, and O. Roustant. A note on the choice and the estimation\n\nof kriging models for the analysis of deterministic computer experiments. 25:115\u2013131, 2009.\n\n[16] R. Bardenet and B. K\u00b4egl. Surrogating the surrogate: accelerating Gaussian Process optimization with\n\nmixtures. In ICML, 2010.\n\n[17] P. Larra\u02dcnaga and J. Lozano, editors. Estimation of Distribution Algorithms: A New Tool for Evolutionary\n\nComputation. Springer, 2001.\n\n[18] N. Hansen. The CMA evolution strategy: a comparing review. In J.A. Lozano, P. Larranaga, I. Inza, and\nE. Bengoetxea, editors, Towards a new evolutionary computation. Advances on estimation of distribution\nalgorithms, pages 75\u2013102. Springer, 2006.\n\n[19] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Learning Workshop\n\n(Snowbird), 2011.\n\n[20] A. Hyv\u00a8arinen and E. Oja. Independent component analysis: Algorithms and applications. Neural Net-\n\nworks, 13(4\u20135):411\u2013430, 2000.\n\n[21] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. JMLR, 2012. Accepted.\n[22] C. Bishop. Neural networks for pattern recognition. 1995.\n[23] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, and Y. Bengio.\nTheano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scienti\ufb01c Comput-\ning Conference (SciPy), June 2010.\n\n9\n\n\f", "award": [], "sourceid": 1385, "authors": [{"given_name": "James", "family_name": "Bergstra", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Bardenet", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Bal\u00e1zs", "family_name": "K\u00e9gl", "institution": null}]}