{"title": "A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function", "book": "Advances in Neural Information Processing Systems", "page_first": 3005, "page_last": 3013, "abstract": "We propose a novel Bayesian approach to solve stochastic optimization problems that involve \ufb01nding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of \ufb01rst, doing inference over the function space and second, \ufb01nding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior where the natural parameter corresponds to a given kernel function and the suf\ufb01cient statistic is composed of the observed function values. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function.", "full_text": "A Nonparametric Conjugate Prior Distribution for\n\nthe Maximizing Argument of a Noisy Function\n\nPedro A. Ortega\n\nMax Planck Institute for Intelligent Systems\nMax Planck Institute for Biolog. Cybernetics\npedro.ortega@tuebingen.mpg.de\n\nJordi Grau-Moya\n\nMax Planck Institute for Intelligent Systems\nMax Planck Institute for Biolog. Cybernetics\n\njordi.grau@tuebingen.mpg.de\n\nTim Genewein\n\nMax Planck Institute for Intelligent Systems\nMax Planck Institute for Biolog. Cybernetics\ntim.genewein@tuebingen.mpg.de\n\nDavid Balduzzi\n\nMax Planck Institute for Intelligent Systems\ndavid.balduzzi@tuebingen.mpg.de\n\nDaniel A. Braun\n\nMax Planck Institute for Intelligent Systems\nMax Planck Institute for Biolog. Cybernetics\ndaniel.braun@tuebingen.mpg.de\n\nAbstract\n\nWe propose a novel Bayesian approach to solve stochastic optimization problems\nthat involve \ufb01nding extrema of noisy, nonlinear functions. Previous work has\nfocused on representing possible functions explicitly, which leads to a two-step\nprocedure of \ufb01rst, doing inference over the function space and second, \ufb01nding\nthe extrema of these functions. Here we skip the representation step and directly\nmodel the distribution over extrema. To this end, we devise a non-parametric\nconjugate prior based on a kernel regressor. The resulting posterior distribution\ndirectly captures the uncertainty over the maximum of the unknown function.\nGiven t observations of the function, the posterior can be evaluated ef\ufb01ciently\nin time O(t2) up to a multiplicative constant. Finally, we show how to apply our\nmodel to optimize a noisy, non-convex, high-dimensional objective function.\n\n1 Introduction\n\nHistorically, the \ufb01elds of statistical inference and stochastic optimization have often developed their\nown speci\ufb01c methods and approaches. Recently, however, there has been a growing interest in\napplying inference-based methods to optimization problems and vice versa [1\u20134]. Here we consider\nstochastic optimization problems where we observe noise-contaminated values from an unknown\nnonlinear function and we want to \ufb01nd the input that maximizes the expected value of this function.\n\nThe problem statement is as follows. Let X be a metric space. Consider a stochastic function\nf : X   R mapping a test point x \u2208 X to real values y \u2208 R characterized by the conditional pdf\nP (y|x). Consider the mean function\n\n\u00aff (x) := E[y|x] = Z yP (y|x) dy.\n\nThe goal consists in modeling the optimal test point\nx\u2217 := arg max\n\n{ \u00aff (x)}.\n\nx\n\n1\n\n(1)\n\n(2)\n\n\fa)\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n0\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\nb)\n\nFigure 1: a) Given an estimate h of the mean function \u00aff (left), a simple probability density function\nover the location of the maximum x\u2217 is obtained using the transformation P (x\u2217) \u221d exp{\u03b1h(x\u2217)},\nwhere \u03b1 > 0 plays the role of the precision (right). b) Illustration of the Gramian matrix for different\ntest locations. Locations that are close to each other produce large off-diagonal entries.\n\nClassic approaches to solve this problem are often based on stochastic approximation methods [5].\nWithin the context of statistical inference, Bayesian optimization methods have been developed\nwhere a prior distribution over the space of functions is assumed and uncertainty is tracked during\nthe entire optimization process [6, 7]. In particular, non-parametric Bayesian approaches such as\nGaussian Processes have been applied for derivative-free optimization [8, 9], also within the context\nof the continuum-armed bandit problem [10]. Typically, these Bayesian approaches aim to explicitly\nrepresent the unknown objective function of (1) by entertaining a posterior distribution over the\nspace of objective functions. In contrast, we aim to model directly the distribution of the maximum\nof (2) conditioned on observations.\n\n2 Brief Description\n\nOur model is intuitively straightforward and easy to implement1. Let h(x) : X \u2192 R be an estimate\nof the mean \u00aff (x) constructed from data Dt := {(xi, yi)}t\ni=1 (Figure 1a, left). This estimate can\neasily be converted into a posterior pdf over the location of the maximum by \ufb01rst multiplying it with\na precision parameter \u03b1 > 0 and then taking the normalized exponential (Figure 1a, right)\n\nP (x\u2217|Dt) \u221d exp{\u03b1 \u00b7 h(x\u2217)}.\n\nIn this transformation, the precision parameter \u03b1 controls the certainty we have over our estimate of\nthe maximizing argument: \u03b1 \u2248 0 expresses almost no certainty, while \u03b1 \u2192 \u221e expresses certainty.\nThe rationale for the precision is: the more distinct inputs we test, the higher the precision\u2014testing\nthe same (or similar) inputs only provides local information and therefore should not increase our\nknowledge about the global maximum. A simple and effective way of implementing this idea is\ngiven by\n\nP (x\u2217|Dt) \u221d exp(cid:26)\u03c1 \u00b7(cid:18)\u03be + t \u00b7 Pi K(xi, xi)\nPiPj K(xi, xj )(cid:19)\n}\n{z\n\neffective # of locations\n\n|\n\n\u00b7 Pi K(xi, x\u2217)yi + K0(x\u2217)y0(x\u2217)\n(cid:27),\n}\n|\n\nPi K(xi, x\u2217) + K0(x\u2217)\n\nestimate of \u00aff (x\u2217)\n\n{z\n\n(3)\n\nwhere \u03c1, \u03be, K, K0 and y0 are parameters of the estimator: \u03c1 > 0 is the precision we gain for each\nnew distinct observation; \u03be > 0 is the number of prior points; K : X \u00d7 X \u2192 R+ is a symmetric\nkernel function; K0 : X \u2192 R+ is a prior precision function; and y0 : X \u2192 R is a prior estimate of\n\u00aff .\nIn (3), the mean function \u00aff is estimated with a kernel regressor [11] that combines the function\nobservations with a prior estimate of the function, and the total effective number of locations is\ncalculated as the sum of the prior locations \u03be and the number of distinct locations in the data Dt.\nThe latter is estimated by multiplying the number of data points t with the coef\ufb01cient\n\n\u2208 (0, 1],\n\n(4)\n\n1Implementations can be downloaded from http://www.adaptiveagents.org/argmaxprior\n\nPi K(xi, xi)\nPiPj K(xi, xj)\n\n2\n\n\fNoisy Function\n\n10 Data Points\n\n100 Data Points\n\n1000 Data Points\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\nFigure 2: Illustration of the posterior distribution over the maximizing argument for 10, 100 and\n1000 observations drawn from a function with varying noise. The top-left panel illustrates the func-\ntion and the variance bounds (one standard deviation). The observations in the center region close\nto x = 1.5 are very noisy. It can be seen that the prior gets progressively washed out with more\nobservations.\n\ni.e. the ratio between the trace of the Gramian matrix (K(xi, xj))i,j and the sum of its entries.\nInputs that are very close to each other will have overlapping kernels, resulting in large off-diagonal\nentries of the Gramian matrix\u2014hence decreasing the number of distinct locations (Figure 1b). For\nexample, if we have t observations from n \u226a t locations, and each location has t/n observations,\nthen the coef\ufb01cient (4) is equal to n/t and hence the number of distinct locations is exactly n, as\nexpected.\n\nFigure 2 illustrates the behavior of the posterior distribution. The expression for the posterior can\nbe calculated up to a constant factor in O(t) time. The computation of the normalizing constant\nis in general intractable. Therefore, our proposed posterior can be easily combined with Markov\nchain Monte Carlo methods (MCMC) to implement stochastic optimizers as will be illustrated in\nSection 4.\n\n3 Derivation\n\n3.1 Function-Based, Indirect Model\n\nOur \ufb01rst task is to derive an indirect Bayesian model for the optimal test point that builds its estimate\nvia the underlying function space. Let G be the set of hypotheses, and assume that each hypothesis\ng \u2208 G corresponds to a stochastic mapping g : X   R. Let P (g) be the prior2 over G and let the\n\nlikelihood be P ({yt}|g, {xt}) = Qt P (yt|g, xt). Then, the posterior of g is given by\nP (g)Qt P (yt|g, xt)\n\nP (g)P ({yt}|g, {xt})\n\nP (g|{yt}, {xt}) =\n\nP ({yt}|{xt})\n\nP ({yt}|{xt})\n\n=\n\n.\n\n(5)\n\nFor each x\u2217 \u2208 X , let G(x\u2217) \u2282 G be the subset of functions such that for all g \u2208 G(x\u2217), x\u2217 =\narg maxx{\u00afg(x)}3. Then, the posterior over the optimal test point x\u2217 is given by\n\nP (x\u2217|{yt}, {xt}) = ZG(x\u2217)\n\nP (g|{yt}, {xt}) dg,\n\n(6)\n\nThis model has two important drawbacks: (a) it relies on modeling the entire function space G,\nwhich is potentially much more complex than necessary; (b) it requires calculating the integral (6),\nwhich is intractable for virtually all real-world problems.\n\n2For the sake of simplicity, we neglect issues of measurability of G.\n3Note that we assume that the mean function \u00afg is bounded and that it has a unique maximizing test point.\n\n3\n\n\f3.2 Domain-Based, Direct Model\n\nWe want to arrive at a Bayesian model that bypasses the integration step suggested by (6) and directly\nmodels the location of optimal test point x\u2217. The following theorem explains how this direct model\nrelates to the previous model.\nTheorem 1. The Bayesian model for the optimal test point x\u2217 is given by\n\nP (x\u2217) = ZG(x\u2217)\n\nP (g) dg\n\nP (yt|x\u2217, xt, Dt\u22121) = RG(x\u2217) P (yt|g, xt)P (g)Qt\u22121\n\nk=1 P (yk|g, xk) dg\n\nk=1 P (yk|g, xk) dg\n\nwhere Dt := {(xk, yk)}t\n\nk=1 is the set of past tests.\n\nRG(x\u2217) P (g)Qt\u22121\n\n(prior)\n\n,\n\n(likelihood)\n\nProof. Using Bayes\u2019 rule, the posterior distribution P (x\u2217|{yt}, {xt}) can be rewritten as\n\nP (x\u2217)Qt P (yt|x\u2217, xt, Dt\u22121)\n\nP ({yt}|{xt})\n\n.\n\n(7)\n\nSince this posterior is equal to (6), one concludes (using (5)) that\n\nP (x\u2217)Yt\n\nP (yt|x\u2217, xt, Dt\u22121) = ZG(x\u2217)\n\nP (g)Yt\n\nP (yt|g, xt) dg.\n\nNote that this expression corresponds to the joint P (x\u2217, {yt}|{xt}). The prior P (x\u2217) is obtained by\nsetting t = 0. The likelihood is obtained as the fraction\n\nP (yt|x\u2217, xt, Dt\u22121) =\n\nP (x\u2217, {yk}t\nP (x\u2217, {yk}t\u22121\n\nk=1|{xk}t\nk=1)\nk=1|{xk}t\u22121\nk=1)\n\n,\n\nwhere it shall be noted that the denominator P (x\u2217, {yk}t\u22121\ncondition xt.\n\nk=1|{xk}t\u22121\n\nk=1) doesn\u2019t change if we add the\n\nFrom Theorem 1 it is seen that although the likelihood model P (yt|g, xt) for the indirect model\nis i.i.d. at each test point, the likelihood model P (yt|x\u2217, xt, Dt\u22121) for the direct model depends\non the past tests Dt\u22121, that is, it is adaptive. More critically though, the likelihood function\u2019s\ninternal structure of the direct model corresponds to an integration over function space as well\u2014\nthus inheriting all the dif\ufb01culties of the indirect model.\n\n3.3 Abstract Properties of the Likelihood Function\n\nThere is a way to bypass modeling the function space explicitly if we make a few additional as-\nsumptions. We assume that for any g \u2208 G(x\u2217), the mean function \u00afg is continuous and has a unique\nmaximum. Then, the crucial insight consists in realizing that the value of the mean function \u00afg inside\na suf\ufb01ciently small neighborhood of x\u2217 is larger than the value outside of it (see Figure 3a).\nWe assume that, for any \u03b4 > 0 and any z \u2208 X , let B\u03b4(z) denote the open \u03b4-ball centered on z. The\nfunctions in G ful\ufb01ll the following properties:\n\na. Continuous: Every function g \u2208 G is such that its mean \u00afg is continuous and bounded.\nb. Maximum: For any x\u2217 \u2208 X , the functions g \u2208 G(x\u2217) are such that for all \u03b4 > 0 and all\n\nz /\u2208 B\u03b4(x\u2217), \u00afg(x\u2217) > \u00afg(z).\n\n2 be in X ,\nFurthermore, we impose a symmetry condition on the likelihood function. Let x\u2217\nand consider their associated equivalence classes G(x\u2217\n2). There is no reason for them to\nbe very different: in fact, they should virtually be indistinguishable outside of the neighborhoods\nof x\u2217\n1) becomes distinguishable from\n1) systematically predict higher values\nthe other equivalence classes because the functions in G(x\u2217\n\n2. It is only inside of the neighborhood of x\u2217\n\n1 when G(x\u2217\n\n1) and G(x\u2217\n\n1 and x\u2217\n\n1 and x\u2217\n\n4\n\n\fa)\n\nb)\n\nc)\n\n0\n\nFigure 3: Illustration of assumptions. a) Three functions from G(x\u2217). They all have their maximum\nlocated at x\u2217 \u2208 X . b) Schematic representation of the likelihood function of x\u2217 \u2208 X conditioned\non a few observations. The curve corresponds to the mean and the shaded area to the con\ufb01dence\nbounds. The density inside of the neighborhood is unique to the hypothesis x\u2217, while the density\noutside is shared amongst all the hypotheses. c) The log-likelihood ratio of the hypotheses x\u2217\n1 and\n2 as a function of the test point x. The kernel used in the plot is Gaussian.\nx\u2217\n\nthan the rest. This assumption is illustrated in Figure 3b. In fact, taking the log-likelihood ratio of\ntwo competing hypotheses\n\nlog\n\nP (yt|x\u2217\nP (yt|x\u2217\n\n1, xt, Dt\u22121)\n2, xt, Dt\u22121)\n\nfor a given test location xt should give a value equal to zero unless xt is inside of the vicinity of x\u2217\n1\n2 (see Figure 3c). In other words, the amount of evidence a hypothesis gets when the test point\nor x\u2217\nis outside of its neighborhood is essentially zero (i.e. it is the same as the amount of evidence that\nmost of the other hypotheses get).\n\n3.4 Likelihood and Conjugate Prior\n\nFollowing our previous discussion, we propose the following likelihood model. Given the previous\ndata Dt\u22121 and a test point xt \u2208 X , the likelihood of the observation yt is\n\nP (yt|x\u2217, xt, Dt\u22121) =\n\n1\n\nZ(xt, Dt\u22121)\n\n\u03bb(yt|xt, Dt\u22121) exp(cid:8)\u03b1t \u00b7 ht(x\u2217) \u2212 \u03b1t\u22121 \u00b7 ht\u22121(x\u2217)(cid:9),\n\n(8)\n\nwhere: Z(xt, Dt\u22121) is a normalizing constant; \u03bb(yt|xt, Dt\u22121) is a posterior probability over yt\ngiven xt and the data Dt\u22121; \u03b1t is a precision measuring the knowledge we have about the whole\nfunction; and and ht is an estimate of the mean function \u00aff. We have chosen the precision \u03b1t as\n\n\u03b1t := \u03c1 \u00b7(cid:16)\u03be + Pi K(xi, xi)\nPiPj K(xi, xj )(cid:17)\n\nwhere \u03c1 > 0 is a scaling parameter; \u03be > 0 is a parameter representing the number prior locations\ntested; and K : X \u00d7 X \u2192 R+ is a symmetric kernel function4. For the estimate ht, we have chosen\na Naradaya-Watson kernel regressor [11]\n\nht(x\u2217) := Pt\n\nPt\n\ni=1 K(xi, x\u2217)yi + K0(x\u2217)y0(x\u2217)\n\n.\n\ni=1 K(xi, x\u2217) + K0(x\u2217)\n\nIn the last expression, y0 corresponds to a prior estimate of \u00aff with prior precision K0. Inspecting (8),\nwe see that the likelihood model favours positive changes to the estimated mean function from new,\nunseen test locations. The pdf \u03bb(yt|xt, Dt\u22121) does not need to be explicitly de\ufb01ned, as it will later\ndrop out when computing the posterior. The only formal requirement is that it should be independent\nof the hypothesis x\u2217.\nWe propose the conjugate prior\n\nP (x\u2217) =\n\n1\nZ0\n\nexp{\u03b10 \u00b7 g0(x\u2217)} =\n\n1\nZ0\n\nexp{\u03be \u00b7 y0(x\u2217)}.\n\n(9)\n\n4We refer the reader to the kernel regression literature for an analysis of the choice of kernel functions.\n\n5\n\n\fThe conjugate prior just encodes a prior estimate of the mean function. In a practical optimization\napplication, it serves the purpose of guiding the exploration of the domain, as locations x\u2217 with high\nprior value y0(x\u2217) are more likely to contain the maximizing argument.\nGiven a set of data points Dt, the prior (9) and the likelihood (8) lead to a posterior given by\n\nk=1 \u03b1k \u00b7 hk(x\u2217) \u2212 \u03b1k\u22121 \u00b7 hk\u22121(x\u2217)(cid:9)Z \u22121\n0 Qt\nk=1 \u03b1k \u00b7 hk(x\u2032) \u2212 \u03b1k\u22121 \u00b7 hk\u22121(x\u2032)(cid:9)Z \u22121\n0 Qt\n\nk=1 Z(xk, Dk\u22121)\u22121\nk=1 Z(xk, Dk\u22121)\u22121 dx\u2032\n\n(10)\n\nk=1 P (yk|x\u2217, xk, Dk\u22121)\nk=1 P (yk|x\u2032, xk, Dk\u22121) dx\u2032\n\nP (x\u2217|Dt) =\n\n=\n\n=\n\nP (x\u2217)Qt\nRX P (x\u2032)Qt\nexp(cid:8)Pt\nRX exp(cid:8)Pt\nexp(cid:8)\u03b1t \u00b7 ht(x\u2217)(cid:9)\nRX exp(cid:8)\u03b1t \u00b7 ht(x\u2032)(cid:9) dx\u2032 .\n\nThus, the particular choice of the likelihood function guarantees an analytically compact posterior\nexpression. In general, the normalizing constant in (10) is intractable, which is why the expression is\nonly practical for relative comparisons of test locations. Substituting the precision \u03b1t and the mean\nfunction estimate ht yields\n\nP (x\u2217|Dt) \u221d exp(cid:26)\u03c1 \u00b7(cid:18)\u03be + t \u00b7 Pi K(xi, xi)\n\nPiPj K(xi, xj )(cid:19) \u00b7 Pi K(xi, x\u2217)yi + K0(x\u2217)y0(x\u2217)\nPi K(xi, x\u2217) + K0(x\u2217)\n\n(cid:27).\n\n4 Experimental Results\n\n4.1 Parameters.\n\nWe have investigated the in\ufb02uence of the parameters on the resulting posterior probability distribu-\ntion. We have used the Gaussian kernel\n\nK(x, x\u2217) = expn\u2212\n\n1\n\n2\u03c32 (x \u2212 x\u2217)2o.\n\n(11)\n\nIn this \ufb01gure, 7 data points are shown, which were drawn as y \u223c N (f (x), 0.3), where the mean\nfunction is\n\nf (x) = cos(2x + 3\n\n2 \u03c0) + sin(6x + 3\n\n2 \u03c0).\n\nThe prior precision K0 and the prior estimate of the mean function y0 were chosen as\n\nK0(x) = 1\n\nand\n\ny0(x) = \u2212\n\n1\n2\u03c32\n0\n\n(x \u2212 \u00b50)2,\n\n(12)\n\n(13)\n\n0 = 5. This prior favours the region close to \u00b5.\n\nwhere the latter corresponds to the logarithm of a Gaussian with mean \u00b50 = 1.5 and vari-\nance \u03c32\nFigure 4 shows how the choice of the precision scale \u03c1 and the kernel width \u03c3 affect the shape of\nthe posterior probability density. Here, it is seen that a larger kernel width \u03c3 increases the region of\nin\ufb02uence of a particular data point, and hence produce smoother posterior densities. The precision\nscale parameter \u03c1 controls the precision per distinct data point: higher values for \u03c1 lead to sharper\nupdates of the posterior distribution.\n\n4.2 Application to Optimization.\n\nThe main motivation behind our proposed model is its application to the optimization of noisy\nfunctions. Because of the noise, choosing new test locations requires carefully balancing explorative\nand exploitative tests\u2014a problem well known in the multiarmed bandits literature. To overcome\nthis, one can apply the Bayesian control rule/Thompson sampling [12, 13]: the next test location\nis chosen by sampling it from the posterior. We have carried out two experiments, described in the\nfollowing.\n\n6\n\n\fa)\n\nb)\n\nc)\n\nFigure 4: Effect of the change of parameters on the posterior density over the location of the max-\nimizing test point. Panel (a) shows the 7 data points drawn from the noisy function (solid curve).\nPanel (b) shows the effect of increasing the width of the kernel (here, Gaussian). The solid and\ndotted curves correspond to \u03c3 = 0.01 and \u03c3 = 0.1 respectively. Panel (c) shows the effect of di-\nminishing the precision on the posterior, where solid and shaded curves correspond to \u03c1 = 0.2 and\n\u03c1 = 0.1 respectively.\n\n                Average Value\n\n      Average Value\n\ns\nb\no\n\ny\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n0\n\ns\nb\no\n\ny\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n0\n\n50\n\n100\n\n150\n\n200\n\n# of samples\n\n50\n\n100\n\n150\n\n200\n\n# of samples\n\nFigure 5: Observation values obtained by sampling from the posterior over the maximizing argument\n(left panel) and according to GP-UCB (right panel). The solid blue curve corresponds to the time-\naveraged function value, averaged over ten runs. The gray area corresponds to the error bounds\n(1 standard deviation), and the dashed curve in red shows the time-average of a single run.\n\nComparison to Gaussian Process UCB. We have used the model to optimize the same func-\ntion (12) as in our preliminary tests but with higher additive noise equal to one. This is done by sam-\npling the next test point xt directly from the posterior density over the optimum location P (x\u2217|Dt),\nand then using the resulting pair (xt, yt) to recursively update the model. Essentially, this procedure\ncorresponds to Bayesian control rule/Thompson sampling.\n\nWe compared our method against a Gaussian Process optimization method using an upper con-\n\ufb01dence bound (UCB) criterion [10]. The parameters for the GP-UCB were set to the following\nvalues: observation noise \u03c3n = 0.3 and length scale \u2113 = 0.3. For the constant that trades off ex-\nploration and exploitation we followed Theorem 1 in [10] which states \u03b2t = 2 log(|D|t2\u03c02/6\u03b4)\nwith \u03b4 = 0.5. We have implemented our proposed method with a Gaussian kernel as in (11) with\nwidth \u03c32 = 0.05. The prior suf\ufb01cient statistics are exactly as in (13). The precision parameter was\nset to \u03c1 = 0.3.\nSimulation results over ten independent runs are summarized in Figure 5. We show the time-\naveraged observation values y of the noisy function evaluated at test locations sampled from the\nposterior. Qualitatively, both methods show very similar convergence (on average), however our\nmethod converges faster and with a slightly higher variance.\n\nHigh-Dimensional Problem. To test our proposed method on a challenging problem, we have\ndesigned a non-convex, high-dimensional noisy function with multiple local optima. This Noisy\nRipples function is de\ufb01ned as\n\nf (x) = \u2212 1\n\n1000 kx \u2212 \u00b5k2 + cos( 2\n\n3 \u03c0kx \u2212 \u00b5k)\n\nwhere \u00b5 \u2208 X is the location of the global maximum, and where observations have additive Gaussian\nnoise with zero mean and variance 0.1. The advantage of this function is that it generalizes well to\nany number of dimensions of the domain. Figure 6a illustrates the function for the 2-dimensional\n\n7\n\n\fa)\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221210\n\n\u221215\n\n\u221215\n\nb)\n\n0\n\n-100\n\n-200\n\n8000\n\n6000\n\n4000\n\n2000\n\n0\n\n0\n\nAverage Value\n\n50\n\nSamples\n\nRegret\n\n100\n\n150\n\n50\n\nSamples\n\n100\n\n150\n\nFigure 6: a) The Noisy Ripples objective function in 2 dimensions. b) The time-averaged value and\nthe regret obtained by the optimization algorithm on a 50-dimensional version of the Noisy Ripples\nfunction.\n\ninput domain. This function is dif\ufb01cult to optimize because it requires averaging the noisy observa-\ntions and smoothing the ridged landscape in order to detect the underlying quadratic form.\n\nWe optimized the 50-dimensional version of this function using a Metropolis-Hastings scheme to\nsample the next test locations from the posterior over the maximizing argument. The Markov chain\nwas started at [20, 20, \u00b7 \u00b7 \u00b7 , 20]T , executing 120 isotropic Gaussian steps of variance 0.07 before\nthe point was used as an actual test location. For the arg-max prior, we used a Gaussian kernel\nwith lengthscale l = 2, precision factor \u03c1 = 1.5, prior precision K0(x\u2217) = 1 and prior mean\nestimate y0(x\u2217) = \u2212 2\nThe result of one run is presented in Figure 6b. It can be seen that the optimizer manages to quickly\n(\u2248 100 samples) reach near-optimal performance, overcoming the dif\ufb01culties associated with the\nhigh-dimensionality of the input space and the numerous local optima. Crucial for this success was\nthe choice of a kernel that is wide enough to accurately estimate the mean function. The authors are\nnot aware of any method capable of solving a problem of similar characteristics.\n\n1000 kx + 5k2. The goal \u00b5 was located at the origin.\n\n5 Conclusions\n\nOur goal was to design a probabilistic model over the maximizing argument that is algorithmically\nef\ufb01cient and statistically robust even for large, high-dimensional noisy functions. To this end, we\nhave derived a Bayesian model that directly captures the uncertainty over the maximizing argument,\nthereby bypassing having to model the underlying function space\u2014a much harder problem.\n\nOur proposed model is computationally very ef\ufb01cient when compared to Gaussian process-based\n(which have cubic time complexity) or models based on upper con\ufb01dence bounds (which require\n\ufb01nding the input maximizing the bound\u2014a generally intractable operation). In our model, evaluat-\ning the posterior up to a constant factor scales quadratically with the size of the data.\n\nIn practice, we have found that one of the main dif\ufb01culties associated with our proposed method is\nthe choice of the parameters. As in any kernel-based estimation method, choosing the appropriate\nkernel bandwidth can signi\ufb01cantly change the estimate and affect the performance of optimizers that\nrely on the model. There is no clear rule on how to choose a good bandwidth.\n\nIn a future research, it will be interesting to investigate the theoretical properties of the proposed\nnonparametric model, such as the convergence speed of the estimator and its relation to the extensive\nliterature on active learning and bandits.\n\n8\n\n\fReferences\n\n[1] E. Brochu, V. Cora, and N. de Freitas. A tutorial on bayesian optimization of expensive cost\nfunctions, with application to active user modeling and hierarchical reinforcement learning.\nTechnical Report TR-2009-023, University of British Columbia, Department of Computer Sci-\nence, 2009.\n\n[2] K. Rawlik, M. Toussaint, and S. Vijayakumar. Approximate inference and stochastic optimal\n\ncontrol. arXiv:1009.3958, 2010.\n\n[3] A. Shapiro. Probabilistic Constrained Optimization: Methodology and Applications, chapter\nStatistical Inference of Stochastic Optimization Problems, pages 282\u2013304. Kluwer Academic\nPublishers, 2000.\n\n[4] H.J. Kappen, V. G\u00b4omez, and M. Opper. Optimal control as a graphical model inference prob-\n\nlem. Machine Learning, 87(2):159\u2013182, 2012.\n\n[5] H.J. Kushner and G.G. Yin. Stochastic Approximation Algorithms and Applications. Springer-\n\nVerlag, 1997.\n\n[6] J. Mockus. Application of bayesian approach to numerical methods of global and stochastic\n\noptimization. Journal of Global Optimization, 4(4):347\u2013365, 1994.\n\n[7] D. Lizotte. Practical Bayesian Optimization. Phd thesis, University of Alberta, 2008.\n[8] D.R. Jones, M. Schonlau, and W.J. Welch. Ef\ufb01cient global optimization of expensive black-\n\nbox functions. Journal of Global Optimization, 13(4):455\u2013492, 1998.\n\n[9] M.A. Osborne, R. Garnett, and S.J. Roberts. Gaussian processes for global optimization. In\n\n3rd International Conference on Learning and Intelligent Optimization (LION3), 2009.\n\n[10] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit\nsetting: No regret and experimental design. In International Conference on Machine Learning,\n2010.\n\n[11] T. Hastie, R. Tbshirani, and J. Friedman. The Elements of Statistical Learning. Springer,\n\nsecond edition, 2009.\n\n[12] P.A. Ortega and D.A. Braun. A minimum relative entropy principle for learning and acting.\n\nJournal of Arti\ufb01cial Intelligence Research, 38:475\u2013511, 2010.\n\n[13] B.C. May and D.S. Leslie. Simulation studies in optimistic Bayesian sampling in contextual-\nbandit problems. Technical Report 11:02, Statistics Group, Department of Mathematics, Uni-\nversity of Bristol, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1362, "authors": [{"given_name": "Pedro", "family_name": "Ortega", "institution": null}, {"given_name": "Jordi", "family_name": "Grau-moya", "institution": null}, {"given_name": "Tim", "family_name": "Genewein", "institution": null}, {"given_name": "David", "family_name": "Balduzzi", "institution": null}, {"given_name": "Daniel", "family_name": "Braun", "institution": null}]}