{"title": "Practical Bayesian Optimization for Model Fitting with Bayesian Adaptive Direct Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1836, "page_last": 1846, "abstract": "Computational models in fields such as computational neuroscience are often evaluated via stochastic simulation or numerical approximation. Fitting these models implies a difficult optimization problem over complex, possibly noisy parameter landscapes. Bayesian optimization (BO) has been successfully applied to solving expensive black-box problems in engineering and machine learning. Here we explore whether BO can be applied as a general tool for model fitting. First, we present a novel hybrid BO algorithm, Bayesian adaptive direct search (BADS), that achieves competitive performance with an affordable computational overhead for the running time of typical models. We then perform an extensive benchmark of BADS vs. many common and state-of-the-art nonconvex, derivative-free optimizers, on a set of model-fitting problems with real data and models from six studies in behavioral, cognitive, and computational neuroscience. With default settings, BADS consistently finds comparable or better solutions than other methods, including `vanilla' BO, showing great promise for advanced BO techniques, and BADS in particular, as a general model-fitting tool.", "full_text": "Practical Bayesian Optimization for Model Fitting\n\nwith Bayesian Adaptive Direct Search\n\nLuigi Acerbi\u2217\n\nCenter for Neural Science\n\nNew York University\n\nluigi.acerbi@nyu.edu\n\nAbstract\n\nCenter for Neural Science & Dept. of Psychology\n\nWei Ji Ma\n\nNew York University\nweijima@nyu.edu\n\nComputational models in \ufb01elds such as computational neuroscience are often\nevaluated via stochastic simulation or numerical approximation. Fitting these\nmodels implies a dif\ufb01cult optimization problem over complex, possibly noisy\nparameter landscapes. Bayesian optimization (BO) has been successfully applied\nto solving expensive black-box problems in engineering and machine learning.\nHere we explore whether BO can be applied as a general tool for model \ufb01tting.\nFirst, we present a novel hybrid BO algorithm, Bayesian adaptive direct search\n(BADS), that achieves competitive performance with an affordable computational\noverhead for the running time of typical models. We then perform an extensive\nbenchmark of BADS vs. many common and state-of-the-art nonconvex, derivative-\nfree optimizers, on a set of model-\ufb01tting problems with real data and models\nfrom six studies in behavioral, cognitive, and computational neuroscience. With\ndefault settings, BADS consistently \ufb01nds comparable or better solutions than\nother methods, including \u2018vanilla\u2019 BO, showing great promise for advanced BO\ntechniques, and BADS in particular, as a general model-\ufb01tting tool.\n\n1\n\nIntroduction\n\nMany complex, nonlinear computational models in \ufb01elds such as behaviorial, cognitive, and compu-\ntational neuroscience cannot be evaluated analytically, but require moderately expensive numerical\napproximations or simulations. In these cases, \ufb01nding the maximum-likelihood (ML) solution \u2013\nfor parameter estimation, or model selection \u2013 requires the costly exploration of a rough or noisy\nnonconvex landscape, in which gradients are often unavailable to guide the search.\nHere we consider the problem of \ufb01nding the (global) optimum x\u2217 = argminx\u2208X E [f (x)] of a\npossibly noisy objective f over a (bounded) domain X \u2286 RD, where the function f can be intended\nas the (negative) log likelihood of a parameter vector x for a given dataset and model, but is generally\na black box. With many derivative-free optimization algorithms available to the researcher [1], it is\nunclear which one should be chosen. Crucially, an inadequate optimizer can hinder progress, limit\nthe complexity of the models that can be \ufb01t, and even cast doubt on the reliability of one\u2019s \ufb01ndings.\nBayesian optimization (BO) is a state-of-the-art machine learning framework for optimizing expensive\nand possibly noisy black-box functions [2, 3, 4]. This makes it an ideal candidate for solving dif\ufb01cult\nmodel-\ufb01tting problems. Yet there are several obstacles to a widespread usage of BO as a general tool\nfor model \ufb01tting. First, traditional BO methods target very costly problems, such as hyperparameter\ntuning [5], whereas evaluating a typical behavioral model might only have a moderate computational\ncost (e.g., 0.1-10 s per evaluation). This implies major differences in what is considered an acceptable\nalgorithmic overhead, and in the maximum number of allowed function evaluations (e.g., hundreds vs.\n\u2217Current address: D\u00e9partement des neurosciences fondamentales, Universit\u00e9 de Gen\u00e8ve, CMU, 1 rue\n\nMichel-Servet, 1206 Gen\u00e8ve, Switzerland. E-mail: luigi.acerbi@gmail.com.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthousands). Second, it is unclear how BO methods would fare in this regime against commonly used\nand state-of-the-art, non-Bayesian optimizers. Finally, BO might be perceived by non-practitioners\nas an advanced tool that requires speci\ufb01c technical knowledge to be implemented or tuned.\nWe address these issues by developing a novel hybrid BO algorithm, Bayesian Adaptive Direct Search\n(BADS), that achieves competitive performance at a small computational cost. We tested BADS,\ntogether with a wide array of commonly used optimizers, on a novel benchmark set of model-\ufb01tting\nproblems with real data and models drawn from studies in cognitive, behaviorial and computational\nneuroscience. Finally, we make BADS available as a free MATLAB package with the same user\ninterface as existing optimizers and that can be used out-of-the-box with no tuning.1\nBADS is a hybrid BO method in that it combines the mesh adaptive direct search (MADS) framework\n[6] (Section 2.1) with a BO search performed via a local Gaussian process (GP) surrogate (Section\n2.2), implemented via a number of heuristics for ef\ufb01ciency (Section 3). BADS proves to be highly\ncompetitive on both arti\ufb01cial functions and real-world model-\ufb01tting problems (Section 4), showing\npromise as a general tool for model \ufb01tting in computational neuroscience and related \ufb01elds.\n\nRelated work There is a large literature about (Bayesian) optimization of expensive, possibly\nstochastic, computer simulations, mostly used in machine learning [3, 4, 5] or engineering (known\nas kriging-based optimization) [7, 8, 9]. Recent work has combined MADS with treed GP models\nfor constrained optimization (TGP-MADS [9]). Crucially, these methods have large overheads and\nmay require problem-speci\ufb01c tuning, making them impractical as a generic tool for model \ufb01tting.\nCheaper but less precise surrogate models than GPs have been proposed, such as random forests [10],\nParzen estimators [11], and dynamic trees [12]. In this paper, we focus on BO based on traditional\nGP surrogates, leaving the analysis of alternative models for future work (see Conclusions).\n\n2 Optimization frameworks\n\n2.1 Mesh adaptive direct search (MADS)\n\nThe MADS algorithm is a directional direct search framework for nonlinear optimization [6, 13].\nBrie\ufb02y, MADS seeks to improve the current solution by testing points in the neighborhood of the\ncurrent point (the incumbent), by moving one step in each direction on an iteration-dependent mesh.\nIn addition, the MADS framework can incorporate in the optimization any arbitrary search strategy\nwhich proposes additional test points that lie on the mesh.\n\nMADS de\ufb01nes the current mesh at the k-th iteration as Mk =(cid:83)\n\nk Dz : z \u2208 ND(cid:9),\n\n(cid:8)x + \u2206mesh\n\n\u2208 R+ is the\nwhere Sk \u2282 Rn is the set of all points evaluated since the start of the iteration, \u2206mesh\nmesh size, and D is a \ufb01xed matrix in RD\u00d7nD whose nD columns represent viable search directions.\nWe choose D = [ID,\u2212ID], where ID is the identity matrix in dimension D.\nEach iteration of MADS comprises of two stages, a SEARCH stage and an optional POLL stage. The\nSEARCH stage evaluates a \ufb01nite number of points proposed by a provided search strategy, with the\nonly restriction that the tested points lie on the current mesh. The search strategy is intended to inject\nproblem-speci\ufb01c information in the optimization. In BADS, we exploit the freedom of SEARCH to\nperform Bayesian optimization in the neighborhood of the incumbent (see Section 2.2 and 3.3). The\nPOLL stage is performed if the SEARCH fails in \ufb01nding a point with an improved objective value.\n\nPOLL constructs a poll set of candidate points, Pk, de\ufb01ned as Pk = (cid:8)xk + \u2206mesh\n\nk v, for v \u2208 Dk (typically, \u2206poll\n\nwhere xk is the incumbent and Dk is the set of polling directions constructed by taking discrete linear\ncombinations of the set of directions D. The poll size parameter \u2206poll\nde\ufb01nes the maximum\n||v||). Points in the\nlength of poll displacement vectors \u2206mesh\npoll set can be evaluated in any order, and the POLL is opportunistic in that it can be stopped as soon\nas a better solution is found. The POLL stage ensures theoretical convergence to a local stationary\npoint according to Clarke calculus for nonsmooth functions [6, 14].\nIf either SEARCH or POLL are a success, \ufb01nding a mesh point with an improved objective value, the\nincumbent is updated and the mesh size remains the same or is multiplied by a factor \u03c4 > 1. If neither\nSEARCH or POLL are successful, the incumbent does not move and the mesh size is divided by \u03c4. The\nalgorithm proceeds until a stopping criterion is met (e.g., maximum budget of function evaluations).\n\nk v : v \u2208 Dk\n\nk \u2265 \u2206mesh\n\nk\n\nk \u2248 \u2206mesh\n\nk\n\nx\u2208Sk\n\nk\n\n(cid:9) ,\n\n1Code available at https://github.com/lacerbi/bads.\n\n2\n\n\f2.2 Bayesian optimization\n\nX =(cid:8)x(i) \u2208 X(cid:9)n\nassume i.i.d. Gaussian observation noise such that f evaluated at x(i) returns y(i) \u223c N(cid:0)f (x(i)), \u03c32(cid:1),\n\nThe typical form of Bayesian optimization (BO) [2] builds a Gaussian process (GP) approximation\nof the objective f, which is used as a relatively inexpensive surrogate to guide the search towards\nregions that are promising (low GP mean) and/or unknown (high GP uncertainty), according to a rule,\nthe acquisition function, that formalizes the exploitation-exploration trade-off.\nGaussian processes GPs are a \ufb02exible class of models for specifying prior distributions over\nunknown functions f : X \u2286 RD \u2192 R [15]. GPs are speci\ufb01ed by a mean function m : X \u2192 R and a\npositive de\ufb01nite covariance, or kernel function k : X \u00d7X \u2192 R. Given any \ufb01nite collection of n points\ni=1, the value of f at these points is assumed to be jointly Gaussian with mean\n(m(x(1)), . . . , m(x(n)))(cid:62) and covariance matrix K, where Kij = k(x(i), x(j)) for 1 \u2264 i, j \u2264 n. We\nand y = (y(1), . . . , y(n))(cid:62) is the vector of observed values. For a deterministic f, we still assume a\nsmall \u03c3 > 0 to improve numerical stability of the GP [16]. Conveniently, observation of such (noisy)\nfunction values will produce a GP posterior whose latent marginal conditional mean \u00b5(x;{X, y} , \u03b8)\nand variance s2(x;{X, y} , \u03b8) at a given point are available in closed form (see Supplementary\nMaterial), where \u03b8 is a hyperparameter vector for the mean, covariance, and likelihood. In the\nfollowing, we omit the dependency of \u00b5 and s2 from the data and GP parameters to reduce clutter.\nCovariance functions Our main choice of stationary (translationally-invariant) covariance function\nis the automatic relevance determination (ARD) rational quadratic (RQ) kernel,\n\n(cid:20)\n\n(cid:21)\u2212\u03b1\n\nD(cid:88)\n\nd=1\n\nkRQ (x, x(cid:48)) = \u03c32\n\nf\n\n1 +\n\n1\n2\u03b1\n\nr2(x, x(cid:48))\n\n,\n\nwith r2(x, x(cid:48)) =\n\n(xd \u2212 x(cid:48)\n\nd)2 ,\n\n(1)\n\n1\n(cid:96)2\nd\n\nwhere \u03c32\nf is the signal variance, (cid:96)1, . . . , (cid:96)D are the kernel length scales along each coordinate direction,\nand \u03b1 > 0 is the shape parameter. More common choices for Bayesian optimization include the\nsquared exponential (SE) kernel [9] or the twice-differentiable ARD Mat\u00e9rn 5/2 (M5/2) kernel [5],\nbut we found the RQ kernel to work best in combination with our method (see Section 4.2). We also\nconsider composite periodic kernels for circular or periodic variables (see Supplementary Material).\nAcquisition function For a given GP approximation of f, the acquisition function, a : X \u2192 R,\ndetermines which point in X should be evaluated next via a proxy optimization xnext = argminxa(x).\nWe consider here the GP lower con\ufb01dence bound (LCB) metric [17],\n\naLCB (x;{X, y} , \u03b8) = \u00b5 (x) \u2212(cid:112)\u03bd\u03b2ts2 (x),\n\n(2)\nwhere \u03bd > 0 is a tunable parameter, t is the number of function evaluations so far, \u03b4 > 0 is a\nprobabilistic tolerance, and \u03b2t is a learning rate chosen to minimize cumulative regret under certain\nassumptions. For BADS we use the recommended values \u03bd = 0.2 and \u03b4 = 0.1 [17]. Another popular\nchoice is the (negative) expected improvement (EI) over the current best function value [18], and an\nhistorical, less used metric is the (negative) probability of improvement (PI) [19].\n\n\u03b2t = 2 ln(cid:0)Dt2\u03c02/(6\u03b4)(cid:1)\n\n3 Bayesian adaptive direct search (BADS)\n\nWe describe here the main steps of BADS (Algorithm 1). Brie\ufb02y, BADS alternates between a series\nof fast, local BO steps (the SEARCH stage of MADS) and a systematic, slower exploration of the\nmesh grid (POLL stage). The two stages complement each other, in that the SEARCH can explore\nthe space very effectively, provided an adequate surrogate model. When the SEARCH repeatedly\nfails, meaning that the GP model is not helping the optimization (e.g., due to a misspeci\ufb01ed model,\nor excess uncertainty), BADS switches to POLL. The POLL stage performs a fail-safe, model-free\noptimization, during which BADS gathers information about the local shape of the objective function,\nso as to build a better surrogate for the next SEARCH. This alternation makes BADS able to deal\neffectively and robustly with a variety of problems. See Supplementary Material for a full description.\n\n3.1\n\nInitial setup\n\nProblem speci\ufb01cation The algorithm is initialized by providing a starting point x0, vectors of hard\nlower/upper bounds LB, UB, and optional vectors of plausible lower/upper bounds PLB, PUB, with the\n\n3\n\n\fAlgorithm 1 Bayesian Adaptive Direct Search\nInput: objective function f, starting point x0, hard bounds LB, UB, (optional: plausible bounds PLB,\n\nPUB, barrier function c, additional options)\n\n0 \u2190 2\u221210, \u2206poll\n\n(cid:46) Section 3.1\n\nif SEARCH is NOT successful then\n\n(cid:46) Section 3.2\n(cid:46) SEARCH stage, Section 3.3\n(cid:46) local Bayesian optimization step\n\nxsearch \u2190 SEARCHORACLE\nEvaluate f on xsearch, if improvement is suf\ufb01cient then break\n\n0 \u2190 1, k \u2190 0, evaluate f on initial design\n(update GP approximation at any step; re\ufb01t hyperparameters if necessary)\nfor 1 . . . nsearch do\n\n1: Initialization: \u2206mesh\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: until fevals > MaxFunEvals or \u2206poll\n(cid:46) stopping criteria\n17: return xend = arg mink f (xk) (or xend = arg mink q\u03b2(xk) for noisy objectives, Section 3.4)\n\ncompute poll set Pk\nevaluate opportunistically f on Pk sorted by acquisition function\n\nupdate incumbent xk+1\nif POLL was successful then \u2206mesh\nelse\nk \u2190 1\n2 \u2206poll\n\u2206mesh\nk \u2190 k + 1\n\nk \u2190 1\n\n2 \u2206mesh\n\n, \u2206poll\n\nk < 10\u22126 or stalling\n\n(cid:46) optional POLL stage, Section 3.3\n\nk \u2190 2\u2206mesh\n\nk\n\n, \u2206poll\n\nk \u2190 2\u2206poll\n\nk\n\nif iteration k is successful then\n\nk\n\nk\n\nrequirement that for each dimension 1 \u2264 d \u2264 D, LBd \u2264 PLBd < PUBd \u2264 UBd.2 Plausible bounds\nidentify a region in parameter space where most solutions are expected to lie. Hard upper/lower\nbounds can be in\ufb01nite, but plausible bounds need to be \ufb01nite. Problem variables whose hard bounds\nare strictly positive and UBd \u2265 10 \u00b7 LBd are automatically converted to log space. All variables\nare then linearly rescaled to the standardized box [\u22121, 1]D such that the box bounds correspond\nto [PLB, PUB] in the original space. BADS supports bound or no constraints, and optionally other\nconstraints via a provided barrier function c (see Supplementary Material). The user can also specify\ncircular or periodic dimensions (such as angles); and whether the objective f is deterministic or noisy\n(stochastic), and in the latter case provide a coarse estimate of the noise (see Section 3.4).\nInitial design The initial design consists of the provided starting point x0 and ninit = D additional\npoints chosen via a space-\ufb01lling quasi-random Sobol sequence [20] in the standardized box, and\nforced to lie on the mesh grid. If the user does not specify whether f is deterministic or stochastic,\nthe algorithm assesses it by performing two consecutive evaluations at x0.\n\n3.2 GP model in BADS\nThe default GP model is speci\ufb01ed by a constant mean function m \u2208 R, a smooth ARD RQ kernel\n(Eq. 1), and we use aLCB (Eq. 2) as a default acquisition function.\nHyperparameters The default GP has hyperparameters \u03b8 = ((cid:96)1, . . . , (cid:96)D, \u03c32\nf , \u03b1, \u03c32, m). We\nimpose an empirical Bayes prior on the GP hyperparameters based on the current training set\n(see Supplementary Material), and select \u03b8 via maximum a posteriori (MAP) estimation. We \ufb01t \u03b8\nvia a gradient-based nonlinear optimizer, starting from either the previous value of \u03b8 or a weighted\ndraw from the prior, as a means to escape local optima. We re\ufb01t the hyperparameters every 2D\nto 5D function evaluations; more often earlier in the optimization, and whenever the current GP\nis particularly inaccurate at predicting new points, according to a normality test on the residuals,\n\nz(i) =(cid:0)y(i) \u2212 \u00b5(x(i))(cid:1) /(cid:112)s2(x(i)) + \u03c32 (assumed independent, in \ufb01rst approximation).\n\nTraining set The GP training set X consists of a subset of the points evaluated so far (the cache),\nselected to build a local approximation of the objective in the neighborhood of the incumbent xk,\nconstructed as follows. Each time X is rebuilt, points in the cache are sorted by their (cid:96)-scaled distance\nr2 (Eq. 1) from xk. First, the closest nmin = 50 points are automatically added to X. Second,\nup to 10D additional points with r \u2264 3\u03c1(\u03b1) are included in the set, where \u03c1(\u03b1) (cid:38) 1 is a radius\n2A variable d can be \ufb01xed by setting (x0)d = LBd = UBd = PLBd = PUBd. Fixed variables become\n\nconstants, and BADS runs on an optimization problem with reduced dimensionality.\n\n4\n\n\fe1/\u03b1 \u2212 1 (see\nfunction that depends on the decay of the kernel. For the RQ kernel, \u03c1RQ(\u03b1) =\nSupplementary Material). Newly evaluated points are added incrementally to the set, using fast\nrank-one updates of the GP posterior. The training set is rebuilt any time the incumbent is moved.\n\n\u03b1\n\n\u221a\n\n\u221a\n\n3.3\n\nImplementation of the MADS framework\n\n0 = 1 and \u2206mesh\n\n0 = 2\u221210 (in standardized space), such that the initial poll steps can\nWe initialize \u2206poll\nspan the plausible region, whereas the mesh grid is relatively \ufb01ne. We use \u03c4 = 2, and increase the\nmesh size only after a successful POLL. We skip the POLL after a successful SEARCH.\nSearch stage We apply an aggressive, repeated SEARCH strategy that consists of up to nsearch =\nmax{D,(cid:98)3 + D/2(cid:99)} unsuccessful SEARCH steps. In each step, we use a search oracle, based on a\nlocal BO with the current GP, to produce a search point xsearch (see below). We evaluate f (xsearch)\nand add it to the training set. If the improvement in objective value is none or insuf\ufb01cient, that is less\nthan (\u2206poll\nk )3/2, we continue searching, or switch to POLL after nsearch steps. Otherwise, we call it a\nsuccess and start a new SEARCH from scratch, centered on the updated incumbent.\nSearch oracle We choose xsearch via a fast, approximate optimization inspired by CMA-ES [21].\nWe sample batches of points in the neighborhood of the incumbent xk, drawn \u223c N (xs, \u03bb2(\u2206poll\nk )2\u03a3),\nwhere xs is the current search focus, \u03a3 a search covariance matrix, and \u03bb > 0 a scaling factor, and\nwe pick the point that optimizes the acquisition function (see Supplementary Material). We remove\nfrom the SEARCH set candidate points that violate non-bound constraints (c(x) > 0), and we project\ncandidate points that fall outside hard bounds to the closest mesh point inside the bounds. Across\n\nSEARCH steps, we use both a diagonal matrix \u03a3(cid:96) with diagonal(cid:0)(cid:96)2\n\nD/|(cid:96)|2(cid:1), and a matrix\n\n\u03a3WCM proportional to the weighted covariance matrix of points in X (each point weighted according\nto a function of its ranking in terms of objective values yi). We choose between \u03a3(cid:96) and \u03a3WCM\nprobabilistically via a hedge strategy, based on their track record of cumulative improvement [22].\nPoll stage We incorporate the GP approximation in the POLL in two ways: when constructing the\nset of polling directions Dk, and when choosing the polling order. We generate Dk according to the\nrandom LTMADS algorithm [6], but then rescale each vector coordinate 1 \u2264 d \u2264 D proportionally\nto the GP length scale (cid:96)d (see Supplementary Material). We discard poll vectors that do not satisfy\nthe given bound or nonbound constraints. Second, since the POLL is opportunistic, we evaluate points\nin the poll set according to the ranking given by the acquisition function [9].\nStopping criteria We stop the optimization when the poll size \u2206poll\nk goes below a threshold (default\n10\u22126); when reaching a maximum number of objective evaluations (default 500D); or if there is no\nsigni\ufb01cant improvement of the objective for more than 4 + (cid:98)D/2(cid:99) iterations. The algorithm returns\nthe optimum xend (transformed back to original coordinates) with the lowest objective value yend.\n\n1/|(cid:96)|2, . . . , (cid:96)2\n\n3.4 Noisy objective\nIn case of a noisy objective, we assume for the noise a hyperprior ln \u03c3 \u223c N (ln \u03c3est, 1), with \u03c3est\na base noise magnitude (default \u03c3est = 1, but the user can provide an estimate). To account for\nadditional uncertainty, we also make the following changes: double the minimum number of points\nadded to the training set, nmin = 100, and increase the maximum number to 200; increase the initial\ndesign to ninit = 20; and double the number of allowed stalled iterations before stopping.\nUncertainty handling Due to noise, we cannot simply use the output values yi as ground truth in\nthe SEARCH and POLL stages. Instead, we replace yi with the GP latent quantile function [23]\n\nq\u03b2 (x;{X, y} , \u03b8) \u2261 q\u03b2(x) = \u00b5 (x) + \u03a6\u22121(\u03b2)s (x) ,\n\n(3)\nwhere \u03a6\u22121(\u00b7) is the quantile function of the standard normal (plugin approach [24]). Moreover, we\nmodify the MADS procedure by keeping an incumbent set {xi}k\ni=1, where xi is the incumbent at the\nend of the i-th iteration. At the end of each POLL we re-evaluate q\u03b2 for all elements of the incumbent\nset, in light of the new points added to the cache. We select as current (active) incumbent the point\nwith lowest q\u03b2(xi). During optimization we set \u03b2 = 0.5 (mean prediction only), which promotes\nexploration. We use a conservative \u03b2end = 0.999 for the last iteration, to select the optimum xend\nreturned by the algorithm in a robust manner. Instead of yend, we return either \u00b5(xend) or an unbiased\nestimate of E[f (xend)] obtained by averaging multiple evaluations (see Supplementary Material).\n\n\u03b2 \u2208 [0.5, 1),\n\n5\n\n\f4 Experiments\n\nWe tested BADS and many optimizers with implementation available in MATLAB (R2015b, R2017a)\non a large set of arti\ufb01cial and real optimization problems (see Supplementary Material for details).\n\n4.1 Design of the benchmark\n\nAlgorithms Besides BADS, we tested 16 optimization algorithms, including popular choices\nsuch as Nelder-Mead (fminsearch [25]), several constrained nonlinear optimizers in the fmincon\nfunction (default interior-point [26], sequential quadratic programming sqp [27], and active-set\nactset [28]), genetic algorithms (ga [29]), random search (randsearch) as a baseline [30]; and\nalso less-known state-of-the-art methods for nonconvex derivative-free optimization [1], such as\nMultilevel Coordinate Search (MCS [31]) and CMA-ES [21, 32] (cmaes, in different \ufb02avors). For\nnoisy objectives, we included algorithms that explicitly handle uncertainty, such as snobfit [33]\nand noisy CMA-ES [34]. Finally, to verify the advantage of BADS\u2019 hybrid approach to BO, we also\ntested a standard, \u2018vanilla\u2019 version of BO [5] (bayesopt, R2017a) on the set of real model-\ufb01tting\nproblems (see below). For all algorithms, including BADS, we used default settings (no \ufb01ne-tuning).\n\nProblem sets First, we considered a standard benchmark set of arti\ufb01cial, noiseless functions\n(BBOB09 [35], 24 functions) in dimensions D \u2208 {3, 6, 10, 15}, for a total of 96 test functions. We\nalso created \u2018noisy\u2019 versions of the same set. Second, we collected model-\ufb01tting problems from six\npublished or ongoing studies in cognitive and computational neuroscience (CCN17). The objectives\nof the CCN17 set are negative log likelihood functions of an input parameter vector, for speci\ufb01ed\ndatasets and models, and can be deterministic or stochastic. For each study in the CCN17 set we\nasked its authors for six different real datasets (i.e., subjects or neurons), divided between one or two\nmain models of interest; collecting a total of 36 test functions with D \u2208 {6, 9, 10, 12, 13}.\nProcedure We ran 50 independent runs of each algorithm on each test function, with randomized\nstarting points and a budget of 500 \u00d7 D function evaluations (200 \u00d7 D for noisy problems). If an\nalgorithm terminated before depleting the budget, it was restarted from a new random point. We\nconsider a run successful if the current best (or returned, for noisy problems) function value is within a\ngiven error tolerance \u03b5 > 0 from the true optimum fmin (or our best estimate thereof).3 For noiseless\nproblems, we compute the fraction of successful runs as a function of number of objective evaluations,\naveraged over datasets/functions and over \u03b5 \u2208 [0.01, 10] (log spaced). This is a realistic range for \u03b5,\nas differences in log likelihood below 0.01 are irrelevant for model selection; an acceptable tolerance\nis \u03b5 \u223c 0.5 (a difference in deviance, the metric used for AIC or BIC, less than 1); larger \u03b5 associate\nwith coarse solutions, but errors larger than 10 would induce excessive biases in model selection. For\nnoisy problems, what matters most is the solution xend that the algorithm actually returns, which,\ndepending on the algorithm, may not necessarily be the point with the lowest observed function value.\nSince, unlike the noiseless case, we generally do not know the solutions that would be returned by any\nalgorithm at every time step, but only at the last step, we plot instead the fraction of successful runs\nat 200 \u00d7 D function evaluations as a function of \u03b5, for \u03b5 \u2208 [0.1, 10] (noise makes higher precisions\nmoot), and averaged over datasets/functions. In all plots we omit error bars for clarity (standard errors\nwould be about the size of the line markers or less).\n\n4.2 Results on arti\ufb01cial functions (BBOB09)\n\nThe BBOB09 noiseless set [35] comprises of 24 functions divided in 5 groups with different properties:\nseparable; low or moderate conditioning; unimodal with high conditioning; multi-modal with adequate\n/ with weak global structure. First, we use this benchmark to show the performance of different\ncon\ufb01gurations for BADS. Note that we selected the default con\ufb01guration (RQ kernel, aLCB) and\nother algorithmic details by testing on a different benchmark set (see Supplementary Material). Fig 1\n(left) shows aggregate results across all noiseless functions with D \u2208 {3, 6, 10, 15}, for alternative\nchoices of kernels and acquisition functions (only a subset is shown, such as the popular M5/2, EI\ncombination), or by altering other features (such as setting nsearch = 1, or \ufb01xing the search covariance\nmatrix to \u03a3(cid:96) or \u03a3WCM). Almost all changes from the default con\ufb01guration worsen performance.\n\n3Note that the error tolerance \u03b5 is not a fractional error, as sometimes reported in optimization, because for\n\nmodel comparison we typically care about (absolute) differences in log likelihoods.\n\n6\n\n\fFigure 1: Arti\ufb01cial test functions (BBOB09). Left & middle: Noiseless functions. Fraction of\nsuccessful runs (\u03b5 \u2208 [0.01, 10]) vs. # function evaluations per # dimensions, for D \u2208 {3, 6, 10, 15}\n(96 test functions); for different BADS con\ufb01gurations (left) and all algorithms (middle). Right:\nHeteroskedastic noise. Fraction of successful runs at 200 \u00d7 D objective evaluations vs. tolerance \u03b5.\n\nNoiseless functions We then compared BADS to other algorithms (Fig 1 middle). Depending on\nthe number of function evaluations, the best optimizers are BADS, methods of the fmincon family,\nand, for large budget of function evaluations, CMA-ES with active update of the covariance matrix.\nNoisy functions We produce noisy versions of the BBOB09 set by adding i.i.d. Gaussian obser-\nvation noise at each function evaluation, y(i) = f (x(i)) + \u03c3(x(i))\u03b7(i), with \u03b7(i) \u223c N (0, 1). We\nconsider a variant with moderate homoskedastic (constant) noise (\u03c3 = 1), and a variant with het-\neroskedastic noise with \u03c3(x) = 1+0.1\u00d7(f (x)\u2212fmin), which follows the observation that variability\ngenerally increases for solutions away from the optimum. For many functions in the BBOB09 set, this\nheteroskedastic noise can become substantial (\u03c3 (cid:29) 10) away from the optimum. Fig 1 (right) shows\naggregate results for the heteroskedastic set (homoskedastic results are similar). BADS outperforms\nall other optimizers, with CMA-ES (active, with or without the noisy option) coming second.\nNotably, BADS performs well even on problems with non-stationary (location-dependent) features,\nsuch as heteroskedastic noise, thanks to its local GP approximation.\n\n4.3 Results on real model-\ufb01tting problems (CCN17)\n\nThe objectives of the CCN17 set are deterministic (e.g., computed via numerical approximation) for\nthree studies (Fig 2), and noisy (e.g., evaluated via simulation) for the other three (Fig 3).\nThe algorithmic cost of BADS is \u223c 0.03 s to 0.15 s per function evaluation, depending on D, mostly\ndue to the re\ufb01tting of the GP hyperparameters. This produces a non-negligible overhead, de\ufb01ned as\n100%\u00d7 (total optimization time / total function time \u22121). For a fair comparison with other methods\nwith little or no overhead, for deterministic problems we also plot the effective performance of BADS\nby accounting for the extra cost per function evaluation. In practice, this correction shifts rightward\nthe performance curve of BADS in log-iteration space, since each function evaluation with BADS has\nan increased fractional time cost. For stochastic problems, we cannot compute effective performance\nas easily, but there we found small overheads (< 5%), due to more costly evaluations (more than 1 s).\nFor a direct comparison with standard BO, we also tested on the CCN17 set a \u2018vanilla\u2019 BO algorithm,\nas implemented in MATLAB R2017a (bayesopt). This implementation closely follows [5], with\noptimization instead of marginalization over GP hyperparameters. Due to the fast-growing cost of\nBO as a function of training set size, we allowed up to 300 training points for the GP, restarting the\nBO algorithm from scratch with a different initial design every 300 BO iterations (until the total\nbudget of function evaluations was exhausted). The choice of 300 iterations already produced a large\naverage algorithmic overhead of \u223c 8 s per function evaluation. In showing the results of bayesopt,\nwe display raw performance without penalizing for the overhead.\n\nCausal inference in visuo-vestibular perception Causal inference (CI) in perception is the pro-\ncess whereby the brain decides whether to integrate or segregate multisensory cues that could arise\nfrom the same or from different sources [39]. This study investigates CI in visuo-vestibular heading\n\n7\n\nFunction evaluations / D1050100500Fraction solved00.250.50.751BBOB09 noiseless (BADS variants)bads (rq,lcb,default)bads (search-wcm)bads (m5/2,ei)bads (search-\u2113)bads (se,pi)bads ( =1)Function evaluations / D1050100500Fraction solved00.250.50.751BBOB09 noiselessbadsfmincon (actset)fminconfmincon (sqp)cmaes (active)mcsfminsearchcmaesglobalpatternsearchsimulannealbndparticleswarmgarandsearchError tolerance \u03b50.10.31310Fraction solved at 200\u00d7D func. evals.00.250.50.751BBOB09 with heteroskedastic noisebadscmaes (noisy,active)cmaes (noisy)snobfitparticleswarmpatternsearchmcsgasimulannealbndfmincon (actset)randsearchfminconfmincon (sqp)fminsearchglobalsearchn\fFigure 2: Real model-\ufb01tting problems (CCN17, deterministic). Fraction of successful runs (\u03b5 \u2208\n[0.01, 10]) vs. # function evaluations per # dimensions. Left: Causal inference in visuo-vestibular\nperception [36] (6 subjects, D = 10). Middle: Bayesian con\ufb01dence in perceptual categorization [37]\n(6 subjects, D = 13). Right: Neural model of orientation selectivity [38] (6 neurons, D = 12).\n\nperception across tasks and under different levels of visual reliability, via a factorial model comparison\n[36]. For our benchmark we \ufb01t three subjects with a Bayesian CI model (D = 10), and another three\nwith a \ufb01xed-criterion CI model (D = 10) that disregards visual reliability. Both models include\nheading-dependent likelihoods and marginalization of the decision variable over the latent space of\nnoisy sensory measurements (xvis, xvest), solved via nested numerical integration in 1-D and 2-D.\nBayesian con\ufb01dence in perceptual categorization This study investigates the Bayesian con\ufb01-\ndence hypothesis that subjective judgments of con\ufb01dence are directly related to the posterior probabil-\nity the observer assigns to a learnt perceptual category [37] (e.g., whether the orientation of a drifting\nGabor patch belongs to a \u2018narrow\u2019 or to a \u2018wide\u2019 category). For our benchmark we \ufb01t six subjects\nto the \u2018Ultrastrong\u2019 Bayesian con\ufb01dence model (D = 13), which uses the same mapping between\nposterior probability and con\ufb01dence across two tasks with different distributions of stimuli. This\nmodel includes a latent noisy decision variable, marginalized over via 1-D numerical integration.\nNeural model of orientation selectivity The authors of this study explore the origins of diversity of\nneuronal orientation selectivity in visual cortex via novel stimuli (orientation mixtures) and modeling\n[38]. We \ufb01t the responses of \ufb01ve V1 and one V2 cells with the authors\u2019 neuronal model (D = 12)\nthat combines effects of \ufb01ltering, suppression, and response nonlinearity [38]. The model has one\ncircular parameter, the preferred direction of motion of the neuron. The model is analytical but still\ncomputationally expensive due to large datasets and a cascade of several nonlinear operations.\nWord recognition memory This study models a word recognition task in which subjects rated their\ncon\ufb01dence that a presented word was in a previously studied list [40] (data from [41]). We consider\nsix subjects divided between two normative models, the \u2018Retrieving Effectively from Memory\u2019 model\n[42] (D = 9) and a similar, novel model4 (D = 6). Both models use Monte Carlo methods to draw\nrandom samples from a large space of latent noisy memories, yielding a stochastic log likelihood.\nTarget detection and localization This study looks at differences in observers\u2019 decision making\nstrategies in target detection (\u2018was the target present?\u2019) and localization (\u2018which one was the target?\u2019)\nwith displays of 2, 3, 4, or 6 oriented Gabor patches.5 Here we \ufb01t six subjects with a previously\nderived ideal observer model [43, 44] (D = 6) with variable-precision noise [45], assuming shared\nparameters between detection and localization. The log likelihood is evaluated via simulation due to\nmarginalization over latent noisy measurements of stimuli orientations with variable precision.\nCombinatorial board game playing This study analyzes people\u2019s strategies in a four-in-a-row\ngame played on a 4-by-9 board against human opponents ([46], Experiment 1). We \ufb01t the data of six\nplayers with the main model (D = 10), which is based on a Best-First exploration of a decision tree\nguided by a feature-based value heuristic. The model also includes feature dropping, value noise, and\nlapses, to better capture human variability. Model evaluation is computationally expensive due to the\n\n4Unpublished; upcoming work from Aspen H. Yoo and Wei Ji Ma.\n5Unpublished; upcoming work from Andra Mihali and Wei Ji Ma.\n\n8\n\n1050100500Function evaluations / D00.250.50.751Fraction solvedCCN17 causal inference[overhead-corrected, 24%]badsbadscmaes (active)cmaesfminsearchpatternsearchparticleswarmglobalsimulannealbndfminconfmincon (sqp)mcsgafmincon (actset)randsearchbayesopt1050100500Function evaluations / D00.250.50.751Fraction solvedCCN17 Bayesian confidencebadsbadsfminconfmincon (sqp)fmincon (actset)cmaes (active)cmaesmcspatternsearchfminsearchparticleswarmglobalrandsearchsimulannealbndgabayesopt1050100500Function evaluations / D00.250.50.751Fraction solvedCCN17 neuronal selectivitybadsbadsfminconfmincon (sqp)fmincon (actset)cmaes (active)cmaesmcsfminsearchpatternsearchsimulannealbndgaglobalparticleswarmrandsearchbayesopt[overhead-corrected, 68%][overhead-corrected, 14%]\fFigure 3: Real model-\ufb01tting problems (CCN17, noisy). Fraction of successful runs at 200 \u00d7 D\nobjective evaluations vs. tolerance \u03b5. Left: Con\ufb01dence in word recognition memory [40] (6 subjects,\nD = 6, 9). Middle: Target detection and localization [44] (6 subjects, D = 6). Right: Combinatorial\nboard game playing [46] (6 subjects, D = 10).\n\nconstruction and evaluation of trees of future board states, and achieved via inverse binomial sampling,\nan unbiased stochastic estimator of the log likelihood [46]. Due to prohibitive computational costs,\nhere we only test major algorithms (MCS is the method used in the paper [46]); see Fig 3 right.\nIn all problems, BADS consistently performs on par with or outperforms all other tested optimizers,\neven when accounting for its extra algorithmic cost. The second best algorithm is either some \ufb02avor\nof CMA-ES or, for some deterministic problems, a member of the fmincon family. Crucially, their\nranking across problems is inconsistent, with both CMA-ES and fmincon performing occasionally\nquite poorly (e.g., fmincon does poorly in the causal inference set because of small \ufb02uctuations\nin the log likelihood landscape caused by coarse numerical integration). Interestingly, vanilla BO\n(bayesopt) performs poorly on all problems, often at the level of random search, and always\nsubstantially worse than BADS, even without accounting for the much larger overhead of bayesopt.\nThe solutions found by bayesopt are often hundreds (even thousands) points of log likelihood from\nthe optimum. This failure is possibly due to the dif\ufb01culty of building a global GP surrogate for BO,\ncoupled with strong non-stationarity of the log likelihood functions; and might be ameliorated by more\ncomplex forms of BO (e.g., input warping to produce nonstationary kernels [47], hyperparameter\nmarginalization [5]). However, these advanced approaches would substantially increase the already\nlarge overhead. Importantly, we expect this poor perfomance to extend to any package which\nimplements vanilla BO (such as BayesOpt [48]), regardless of the ef\ufb01ciency of implementation.\n\n5 Conclusions\n\nWe have developed a novel BO method and an associated toolbox, BADS, with the goal of \ufb01tting\nmoderately expensive computational models out-of-the-box. We have shown on real model-\ufb01tting\nproblems that BADS outperforms widely used and state-of-the-art methods for nonconvex, derivative-\nfree optimization, including \u2018vanilla\u2019 BO. We attribute the robust performance of BADS to the\nalternation between the aggressive SEARCH strategy, based on local BO, and the failsafe POLL stage,\nwhich protects against failures of the GP surrogate \u2013 whereas vanilla BO does not have such failsafe\nmechanisms, and can be strongly affected by model misspeci\ufb01cation. Our results demonstrate that\na hybrid Bayesian approach to optimization can be bene\ufb01cial beyond the domain of very costly\nblack-box functions, in line with recent advancements in probabilistic numerics [49].\nLike other surrogate-based methods, the performance of BADS is linked to its ability to obtain a fast\napproximation of the objective, which generally deteriorates in high dimensions, or for functions\nwith pathological structure (often improvable via reparameterization). From our tests, we recommend\nBADS, paired with some multi-start optimization strategy, for models with up to \u223c 15 variables,\na noisy or jagged log likelihood landscape, and when algorithmic overhead is (cid:46) 75% (e.g., model\nevaluation (cid:38) 0.1 s). Future work with BADS will focus on testing alternative statistical surrogates\ninstead of GPs [12]; combining it with a smart multi-start method for global optimization; providing\nsupport for tunable precision of noisy observations [23]; improving the numerical implementation;\nand recasting some of its heuristics in terms of approximate inference.\n\n9\n\n0.10.31310Error tolerance \u03b500.250.50.751Fraction solved at 200\u00d7D func. evals.CCN17 word recognition memorybadscmaes (noisy,active)cmaes (noisy)fminsearchpatternsearchparticleswarmfmincon (actset)gamcssimulannealbndrandsearchsnobfitglobalbayesopt0.10.31310Error tolerance \u03b500.250.50.751Fraction solved at 200\u00d7D func. evals.CCN17 target detection/localizationbadscmaes ( )cmaes (noisy)snobfitbayesoptparticleswarmmcspatternsearchfminsearchsimulannealbndgaglobalfmincon (actset)randsearchnoisy,active0.10.31310Error tolerance \u03b500.250.50.751Fraction solved at 200\u00d7D func. evals.CCN17 combinatorial game playingbadscmaes (noisy,active)particleswarmbayesoptsnobfitmcspatternsearchfminsearch\fAcknowledgments\n\nWe thank Will Adler, Robbe Goris, Andra Mihali, Bas van Opheusden, and Aspen Yoo for sharing\ndata and model evaluation code that we used in the CCN17 benchmark set; Maija Honig, Andra\nMihali, Bas van Opheusden, and Aspen Yoo for providing user feedback on earlier versions of the\nbads package for MATLAB; Will Adler, Andra Mihali, Bas van Opheusden, and Aspen Yoo for\nhelpful feedback on a previous version of this manuscript; John Wixted and colleagues for allowing us\nto reuse their data for the CCN17 \u2018word recognition memory\u2019 problem set; and the three anonymous\nreviewers for useful feedback. This work has utilized the NYU IT High Performance Computing\nresources and services.\n\nReferences\n[1] Rios, L. M. & Sahinidis, N. V. (2013) Derivative-free optimization: A review of algorithms and comparison\n\nof software implementations. Journal of Global Optimization 56, 1247\u20131293.\n\n[2] Jones, D. R., Schonlau, M., & Welch, W. J. (1998) Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global Optimization 13, 455\u2013492.\n\n[3] Brochu, E., Cora, V. M., & De Freitas, N. (2010) A tutorial on Bayesian optimization of expensive cost\nfunctions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint\narXiv:1012.2599.\n\n[4] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & de Freitas, N. (2016) Taking the human out of the\n\nloop: A review of Bayesian optimization. Proceedings of the IEEE 104, 148\u2013175.\n\n[5] Snoek, J., Larochelle, H., & Adams, R. P. (2012) Practical Bayesian optimization of machine learning\n\nalgorithms. Advances in Neural Information Processing Systems 24, 2951\u20132959.\n\n[6] Audet, C. & Dennis Jr, J. E. (2006) Mesh adaptive direct search algorithms for constrained optimization.\n\nSIAM Journal on optimization 17, 188\u2013217.\n\n[7] Taddy, M. A., Lee, H. K., Gray, G. A., & Grif\ufb01n, J. D. (2009) Bayesian guided pattern search for robust\n\nlocal optimization. Technometrics 51, 389\u2013401.\n\n[8] Picheny, V. & Ginsbourger, D. (2014) Noisy kriging-based optimization methods: A uni\ufb01ed implementation\n\nwithin the DiceOptim package. Computational Statistics & Data Analysis 71, 1035\u20131053.\n\n[9] Gramacy, R. B. & Le Digabel, S. (2015) The mesh adaptive direct search algorithm with treed Gaussian\n\nprocess surrogates. Paci\ufb01c Journal of Optimization 11, 419\u2013447.\n\n[10] Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011) Sequential model-based optimization for general\n\nalgorithm con\ufb01guration. LION 5, 507\u2013523.\n\n[11] Bergstra, J. S., Bardenet, R., Bengio, Y., & K\u00e9gl, B. (2011) Algorithms for hyper-parameter optimization.\n\npp. 2546\u20132554.\n\n[12] Talgorn, B., Le Digabel, S., & Kokkolaras, M. (2015) Statistical surrogate formulations for simulation-\n\nbased design optimization. Journal of Mechanical Design 137, 021405\u20131\u2013021405\u201318.\n\n[13] Audet, C., Cust\u00f3dio, A., & Dennis Jr, J. E. (2008) Erratum: Mesh adaptive direct search algorithms for\n\nconstrained optimization. SIAM Journal on Optimization 18, 1501\u20131503.\n\n[14] Clarke, F. H. (1983) Optimization and Nonsmooth Analysis. (John Wiley & Sons, New York).\n[15] Rasmussen, C. & Williams, C. K. I. (2006) Gaussian Processes for Machine Learning. (MIT Press).\n[16] Gramacy, R. B. & Lee, H. K. (2012) Cases for the nugget in modeling computer experiments. Statistics\n\nand Computing 22, 713\u2013722.\n\n[17] Srinivas, N., Krause, A., Seeger, M., & Kakade, S. M. (2010) Gaussian process optimization in the bandit\n\nsetting: No regret and experimental design. ICML-10 pp. 1015\u20131022.\n\n[18] Mockus, J., Tiesis, V., & Zilinskas, A. (1978) in Towards Global Optimisation. (North-Holland Amster-\n\ndam), pp. 117\u2013129.\n\n[19] Kushner, H. J. (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the\n\npresence of noise. Journal of Basic Engineering 86, 97\u2013106.\n\n[20] Bratley, P. & Fox, B. L. (1988) Algorithm 659: Implementing Sobol\u2019s quasirandom sequence generator.\n\nACM Transactions on Mathematical Software (TOMS) 14, 88\u2013100.\n\n[21] Hansen, N., M\u00fcller, S. D., & Koumoutsakos, P. (2003) Reducing the time complexity of the derandomized\n\nevolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation 11, 1\u201318.\n\n[22] Hoffman, M. D., Brochu, E., & de Freitas, N. (2011) Portfolio allocation for Bayesian optimization.\n\nProceedings of the Twenty-Seventh Conference on Uncertainty in Arti\ufb01cial Intelligence pp. 327\u2013336.\n\n10\n\n\f[23] Picheny, V., Ginsbourger, D., Richet, Y., & Caplin, G. (2013) Quantile-based optimization of noisy\n\ncomputer experiments with tunable precision. Technometrics 55, 2\u201313.\n\n[24] Picheny, V., Wagner, T., & Ginsbourger, D. (2013) A benchmark of kriging-based in\ufb01ll criteria for noisy\n\noptimization. Structural and Multidisciplinary Optimization 48, 607\u2013626.\n\n[25] Lagarias, J. C., Reeds, J. A., Wright, M. H., & Wright, P. E. (1998) Convergence properties of the\n\nNelder\u2013Mead simplex method in low dimensions. SIAM Journal on Optimization 9, 112\u2013147.\n\n[26] Waltz, R. A., Morales, J. L., Nocedal, J., & Orban, D.\n\n(2006) An interior algorithm for nonlinear\noptimization that combines line search and trust region steps. Mathematical Programming 107, 391\u2013408.\n[27] Nocedal, J. & Wright, S. (2006) Numerical Optimization, Springer Series in Operations Research. (Springer\n\nVerlag), 2nd edition.\n\n[28] Gill, P. E., Murray, W., & Wright, M. H. (1981) Practical Optimization. (Academic press).\n[29] Goldberg, D. E. (1989) Genetic Algorithms in Search, Optimization & Machine Learning. (Addison-\n\nWesley).\n\n[30] Bergstra, J. & Bengio, Y. (2012) Random search for hyper-parameter optimization. Journal of Machine\n\nLearning Research 13, 281\u2013305.\n\n[31] Huyer, W. & Neumaier, A. (1999) Global optimization by multilevel coordinate search. Journal of Global\n\nOptimization 14, 331\u2013355.\n\n[32] Jastrebski, G. A. & Arnold, D. V. (2006) Improving evolution strategies through active covariance matrix\n\nadaptation. IEEE Congress on Evolutionary Computation (CEC 2006). pp. 2814\u20132821.\n\n[33] Csendes, T., P\u00e1l, L., Sendin, J. O. H., & Banga, J. R. (2008) The GLOBAL optimization method revisited.\n\nOptimization Letters 2, 445\u2013454.\n\n[34] Hansen, N., Niederberger, A. S., Guzzella, L., & Koumoutsakos, P.\n\n(2009) A method for handling\nuncertainty in evolutionary optimization with an application to feedback control of combustion. IEEE\nTransactions on Evolutionary Computation 13, 180\u2013197.\n\n[35] Hansen, N., Finck, S., Ros, R., & Auger, A. (2009) Real-parameter black-box optimization benchmarking\n\n2009: Noiseless functions de\ufb01nitions.\n\n[36] Acerbi, L., Dokka, K., Angelaki, D. E., & Ma, W. J. (2017) Bayesian comparison of explicit and implicit\n\ncausal inference strategies in multisensory heading perception. bioRxiv preprint bioRxiv:150052.\n\n[37] Adler, W. T. & Ma, W. J. (2017) Human con\ufb01dence reports account for sensory uncertainty but in a\n\nnon-Bayesian way. bioRxiv preprint bioRxiv:093203.\n\n[38] Goris, R. L., Simoncelli, E. P., & Movshon, J. A. (2015) Origin and function of tuning diversity in macaque\n\nvisual cortex. Neuron 88, 819\u2013831.\n\n[39] K\u00f6rding, K. P., Beierholm, U., Ma, W. J., Quartz, S., Tenenbaum, J. B., & Shams, L. (2007) Causal\n\ninference in multisensory perception. PLoS One 2, e943.\n\n[40] van den Berg, R., Yoo, A. H., & Ma, W. J. (2017) Fechner\u2019s law in metacognition: A quantitative model of\n\nvisual working memory con\ufb01dence. Psychological Review 124, 197\u2013214.\n\n[41] Mickes, L., Wixted, J. T., & Wais, P. E. (2007) A direct test of the unequal-variance signal detection model\n\nof recognition memory. Psychonomic Bulletin & Review 14, 858\u2013865.\n\n[42] Shiffrin, R. M. & Steyvers, M. (1997) A model for recognition memory: REM\u2014retrieving effectively\n\nfrom memory. Psychonomic Bulletin & Review 4, 145\u2013166.\n\n[43] Ma, W. J., Navalpakkam, V., Beck, J. M., van Den Berg, R., & Pouget, A. (2011) Behavior and neural\n\nbasis of near-optimal visual search. Nature Neuroscience 14, 783\u2013790.\n\n[44] Mazyar, H., van den Berg, R., & Ma, W. J. (2012) Does precision decrease with set size? J Vis 12, 1\u201310.\n[45] van den Berg, R., Shin, H., Chou, W.-C., George, R., & Ma, W. J. (2012) Variability in encoding precision\n\naccounts for visual short-term memory limitations. Proc Natl Acad Sci U S A 109, 8780\u20138785.\n\n[46] van Opheusden, B., Bnaya, Z., Galbiati, G., & Ma, W. J.\n\n(2016) Do people think like computers?\n\nInternational Conference on Computers and Games pp. 212\u2013224.\n\n[47] Snoek, J., Swersky, K., Zemel, R., & Adams, R. (2014) Input warping for Bayesian optimization of\n\nnon-stationary functions. pp. 1674\u20131682.\n\n[48] Martinez-Cantin, R.\n\n(2014) BayesOpt: A Bayesian optimization library for nonlinear optimization,\n\nexperimental design and bandits. Journal of Machine Learning Research 15, 3735\u20133739.\n\n[49] Hennig, P., Osborne, M. A., & Girolami, M. (2015) Probabilistic numerics and uncertainty in computations.\n\nProceedings of the Royal Society A 471, 20150142.\n\n11\n\n\f", "award": [], "sourceid": 1148, "authors": [{"given_name": "Luigi", "family_name": "Acerbi", "institution": "New York University"}, {"given_name": "Wei Ji", "family_name": "Ma", "institution": "New York University"}]}