{"title": "Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains", "book": "Advances in Neural Information Processing Systems", "page_first": 865, "page_last": 872, "abstract": null, "full_text": "Simulated Annealing: Rigorous \ufb01nite-time guarantees\n\nfor optimization on continuous domains\n\nAndrea Lecchini-Visintini\nDepartment of Engineering\nUniversity of Leicester, UK\nalv1@leicester.ac.uk\n\nJohn Lygeros\n\nAutomatic Control Laboratory\n\nETH Zurich, Switzerland.\n\nlygeros@control.ee.ethz.ch\n\nJan Maciejowski\n\nDepartment of Engineering\nUniversity of Cambridge, UK\n\njmm@eng.cam.ac.uk\n\nAbstract\n\nSimulated annealing is a popular method for approaching the solution of a global\noptimization problem. Existing results on its performance apply to discrete com-\nbinatorial optimization where the optimization variables can assume only a \ufb01nite\nset of possible values. We introduce a new general formulation of simulated an-\nnealing which allows one to guarantee \ufb01nite-time performance in the optimiza-\ntion of functions of continuous variables. The results hold universally for any\noptimization problem on a bounded domain and establish a connection between\nsimulated annealing and up-to-date theory of convergence of Markov chain Monte\nCarlo methods on continuous domains. This work is inspired by the concept of\n\ufb01nite-time learning with known accuracy and con\ufb01dence developed in statistical\nlearning theory.\n\nOptimization is the general problem of \ufb01nding a value of a vector of variables \u03b8 that maximizes\n(or minimizes) some scalar criterion U (\u03b8). The set of all possible values of the vector \u03b8 is called\nthe optimization domain. The elements of \u03b8 can be discrete or continuous variables. In the \ufb01rst case\nthe optimization domain is usually \ufb01nite, such as in the well-known traveling salesman problem; in\nthe second case the optimization domain is a continuous set. An important example of a continuous\noptimization domain is the set of 3-D con\ufb01gurations of a sequence of amino-acids in the problem of\n\ufb01nding the minimum energy folding of the corresponding protein [1].\n\nIn principle, any optimization problem on a \ufb01nite domain can be solved by an exhaustive search.\nHowever, this is often beyond computational capacity: the optimization domain of the traveling\nsalesman problem with 100 cities contains more than 10155 possible tours. An ef\ufb01cient algorithm\nto solve the traveling salesman and many similar problems has not yet been found and such prob-\nlems remain reliably solvable only in principle [2]. Statistical mechanics has inspired widely used\nmethods for \ufb01nding good approximate solutions in hard discrete optimization problems which defy\nef\ufb01cient exact solutions [3, 4, 5, 6]. Here a key idea has been that of simulated annealing [3]: a\nrandom search based on the Metropolis-Hastings algorithm, such that the distribution of the ele-\nments of the domain visited during the search converges to an equilibrium distribution concentrated\naround the global optimizers. Convergence and \ufb01nite-time performance of simulated annealing on\n\ufb01nite domains has been evaluated in many works, e.g. [7, 8, 9, 10].\n\nOn continuous domains, most popular optimization methods perform a local gradient-based\nsearch and in general converge to local optimizers; with the notable exception of convex criteria\nwhere convergence to the unique global optimizer occurs [11]. Simulated annealing performs a\nglobal search and can be easily implemented on continuous domains. Hence it can be considered\na powerful complement to local methods. In this paper, we introduce for the \ufb01rst time rigorous\nguarantees on the \ufb01nite-time performance of simulated annealing on continuous domains. We will\n\n\fshow that it is possible to derive simulated annealing algorithms which, with an arbitrarily high level\nof con\ufb01dence, \ufb01nd an approximate solution to the problem of optimizing a function of continuous\nvariables, within a speci\ufb01ed tolerance to the global optimal solution after a known \ufb01nite number of\nsteps. Rigorous guarantees on the \ufb01nite-time performance of simulated annealing in the optimiza-\ntion of functions of continuous variables have never been obtained before; the only results available\nstate that simulated annealing converges to a global optimizer as the number of steps grows to in\ufb01n-\nity, e.g. [12, 13, 14, 15].\n\nThe background of our work is twofold. On the one hand, our notion of approximate solution to\na global optimization problem is inspired by the concept of \ufb01nite-time learning with known accuracy\nand con\ufb01dence developed in statistical learning theory [16, 17]. We actually maintain an important\naspect of statistical learning theory which is that we do not introduce any particular assumption on\nthe optimization criterion, i.e. our results hold regardless of what U is. On the other hand, we ground\nour results on the theory of convergence, with quantitative bounds on the distance to the target dis-\ntribution, of the Metropolis-Hastings algorithm and Markov Chain Monte Carlo (MCMC) methods,\nwhich has been one of the main achievements of recent research in statistics [18, 19, 20, 21].\n\nIn this paper, we will not develop any ready-to-use optimization algorithm. We will instead in-\ntroduce a general formulation of the simulated annealing method which allows one to derive new\nsimulated annealing algorithms with rigorous \ufb01nite-time guarantees on the basis of existing theory.\nThe Metropolis-Hastings algorithm and the general family of MCMC methods have many degrees\nof freedom. The choice and comparison of speci\ufb01c algorithms goes beyond the scope of the paper.\nIn Simulated annealing we introduce the\nmethod and \ufb01x the notation. In Convergence we recall the reasons why \ufb01nite-time guarantees for\nsimulated annealing on continuous domains have not been obtained before. In Finite-time guaran-\ntees we present the main result of the paper. In Conclusions we state our \ufb01ndings and conclude the\npaper.\n\nThe paper is organized in the following sections.\n\n1 Simulated annealing\n\nThe original formulation of simulated annealing was inspired by the analogy between the stochastic\nevolution of the thermodynamic state of an annealing material towards the con\ufb01gurations of minimal\nenergy and the search for the global minimum of an optimization criterion [3]. In the procedure, the\noptimization criterion plays the role of the energy and the state of the annealed material is simulated\nby the evolution of the state of an inhomogeneous Markov chain. The state of the chain evolves\naccording to the Metropolis-Hastings algorithm in order to simulate the Boltzmann distribution of\nthermodynamic equilibrium. The Boltzmann distribution is simulated for a decreasing sequence of\ntemperatures (\u201ccooling\u201d). The target distribution of the cooling procedure is the limiting Boltzmann\ndistribution, for the temperature that tends to zero, which takes non-zero values only on the set of\nglobal minimizers [7].\n\nThe original formulation of the method was for a \ufb01nite domain. However, simulated anneal-\ning can be generalized straightforwardly to a continuous domain because the Metropolis-Hastings\nalgorithm can be used with almost no differences on discrete and continuous domains The main\ndifference is that on a continuous domain the equilibrium distributions are speci\ufb01ed by probability\ndensities. On a continuous domain, Markov transition kernels in which the distribution of the el-\nements visited by the chain converges to an equilibrium distribution with the desired density can\nbe constructed using the Metropolis-Hastings algorithm and the general family of MCMC methods\n[22].\n\nWe point out that Boltzmann distributions are not the only distributions which can be adopted as\nequilibrium distributions in simulated annealing [7]. In this paper it is convenient for us to adopt a\ndifferent type of equilibrium distribution in place of Boltzmann distributions.\n\n1.1 Our setting\n\nThe optimization criterion is U : \u0398 \u2192 [0, 1], with \u0398 \u2282 RN . The assumption that U takes values in\nthe interval [0, 1] is a technical one. It does not imply any serious loss of generality. In general, any\nbounded optimization criterion can be scaled to take values in [0, 1]. We assume that the optimiza-\ntion task is to \ufb01nd a global maximizer; this can be done without loss of generality. We also assume\nthat \u0398 is a bounded set.\n\nWe consider equilibrium distributions de\ufb01ned by probability density functions proportional to\n\n\f[U (\u03b8) + \u03b4]J where J and \u03b4 are two strictly positive parameters. We use \u03c0(J) to denote an equi-\nlibrium distribution, i.e. \u03c0(J)(d\u03b8) \u221d [U (\u03b8) + \u03b4]J \u03c0Leb(d\u03b8) where \u03c0Leb is the standard Lebesgue\nmeasure. Here, J \u22121 plays the role of the temperature: if the function U (\u03b8) plus \u03b4 is taken to a\npositive power J then as J increases (i.e. as J \u22121 decreases) [U (\u03b8) + \u03b4]J becomes increasingly\npeaked around the global maximizers. The parameter \u03b4 is an offset which guarantees that the\nequilibrium densities are always strictly positive, even if U takes zero values on some elements\nof the domain. The offset \u03b4 is chosen by the user and we show later that our results allow one to\nmake an optimal selection of \u03b4. The zero-temperature distribution is the limiting distribution, for\nJ \u2192 \u221e, which takes non-zero values only on the set of global maximizers. It is denoted by \u03c0(\u221e).\nIn the generic formulation of the method, the Markov transition kernel of the k-th step of the\ninhomogeneous chain has equilibrium distribution \u03c0(Jk) where {Jk}k=1,2,... is the \u201ccooling sched-\nule\u201d. The cooling schedule is a non-decreasing sequence of positive numbers according to which\nthe equilibrium distribution become increasingly sharpened during the evolution of the chain. We\nuse \u03b8k to denote the state of the chain and P\u03b8k to denote its probability distribution. The distribution\nP\u03b8k obviously depends on the initial condition \u03b80. However, in this work, we don\u2019t need to make\nthis dependence explicit in the notation.\n\nRemark 1: If, given an element \u03b8 in \u0398, the value U (\u03b8) can be computed directly, we say that U\nis a deterministic criterion, e.g. the energy landscape in protein structure prediction [1]. In problems\n\ninvolving random variables, the value U (\u03b8) may be the expected value U (\u03b8) = R g(x, \u03b8)px(x; \u03b8)dx\n\nof some function g which depends on both the optimization variable \u03b8, and on some random vari-\nable x which has probability density px(x; \u03b8) (which may itself depend on \u03b8). In such problems it\nis usually not possible to compute U (\u03b8) directly, either because evaluation of the integral requires\ntoo much computation, or because no analytical expression for px(x; \u03b8) is available. Typically one\nmust perform stochastic simulations in order to obtain samples of x for a given \u03b8, hence obtain\nsample values of g(x, \u03b8), and thus construct a Monte Carlo estimate of U (\u03b8). The Bayesian design\nof clinical trials is an important application area where such expected-value criteria arise [23]. The\nauthors of this paper investigate the optimization of expected-value criteria motivated by problems\nof aircraft routing [24]. In the particular case that px(x; \u03b8) does not depend on \u03b8, the optimization\ntask is often called \u201cempirical risk minimization\u201d, and is studied extensively in statistical learning\ntheory [16, 17]. The results of this paper apply in the same way to the optimization of both deter-\nministic and expected-value criteria. The MCMC method developed by M\u00a8uller [25, 26] allows one\nto construct simulated annealing algorithms for the optimization of expected-value criteria. M\u00a8uller\n[25, 26] employs the same equilibrium distributions as those described in our setting; in his context\nJ is restricted to integer values.\n\n2 Convergence\n\nThe rationale of simulated annealing is as follows: if the temperature is kept constant, say Jk = J,\nthen the distribution of the state of the chain P\u03b8k tends to the equilibrium distribution \u03c0(J); if J \u2192 \u221e\nthen the equilibrium distribution \u03c0(J) tends to the zero-temperature distribution \u03c0(\u221e); as a result, if\nthe cooling schedule Jk tends to in\ufb01nity, one obtains that P\u03b8k \u201cfollows\u201d \u03c0(Jk) and that \u03c0(Jk) tends\nto \u03c0(\u221e) and eventually that the distribution of the state of the chain P\u03b8k tends to \u03c0(\u221e). The theory\nshows that, under conditions on the cooling schedule and the Markov transition kernels, the distri-\nbution of the state of the chain P\u03b8k actually converges to the target zero-temperature distribution\n\u03c0(\u221e) as k \u2192 \u221e [12, 13, 14, 15]. Convergence to the zero-temperature distribution implies that\nasymptotically the state of the chain eventually coincides with a global optimizer with probability\none.\n\nThe dif\ufb01culty which must be overcome in order to obtain \ufb01nite step results on simulated an-\nnealing algorithms on a continuous domain is that usually, in an optimization problem de\ufb01ned over\ncontinuous variables, the set of global optimizers has zero Lebesgue measure (e.g. a set of isolated\npoints). If the set of global optimizers has zero measure then the set of global optimizers has null\nprobability according to the equilibrium distributions \u03c0(J) for any \ufb01nite J and, as a consequence,\naccording to the distributions P\u03b8k for any \ufb01nite k. Put another way, the probability that the state of\nthe chain visits the set of global optimizers is constantly zero after any \ufb01nite number of steps. Hence\nthe con\ufb01dence of the fact that the solution provided by the algorithm in \ufb01nite time coincides with a\nglobal optimizer is also constantly zero. Notice that this is not the case for a \ufb01nite domain, where\nthe set of global optimizers is of non-null measure with respect to the reference counting measure\n\n\f[7, 8, 9, 10].\n\nIt is instructive to look at the issue also in terms of the rate of convergence to the target zero-\ntemperature distribution. On a discrete domain, the distribution of the state of the chain at each\nstep and the zero-temperature distribution are both standard discrete distributions. It is then possible\nto de\ufb01ne a distance between them and study the rate of convergence of this distance to zero. This\nanalysis allows one to obtain results on the \ufb01nite-time behavior of simulated annealing [7, 8]. On a\ncontinuous domain and for a set of global optimizers of measure zero, the target zero-temperature\ndistribution \u03c0(\u221e) ends up being a mixture of probability masses on the set of global optimizers. In\nthis situation, although the distribution of the state of the chain P\u03b8k still converges asymptotically\nto \u03c0(\u221e), it is not possible to introduce a sensible distance between the two distributions and a rate\nof convergence to the target distribution cannot even be de\ufb01ned (weak convergence), see [12, The-\norem 3.3]. This is the reason that until now there have been no guarantees on the performance of\nsimulated annealing on a continuous domain after a \ufb01nite number of computations: by adopting the\nzero-temperature distribution \u03c0(\u221e) as the target distribution it is only possible to prove asymptotic\nconvergence in in\ufb01nite time to a global optimizer.\n\nRemark 2: The standard distance between two distributions, say \u00b51 and \u00b52, on a continuous sup-\nport is the total variation norm k\u00b51 \u2212 \u00b52kT V = supA |\u00b51(A) \u2212 \u00b52(A)|, see e.g. [21]. In simulated\nannealing on a continuous domain the distribution of the state of the chain P\u03b8k is absolutely con-\ntinuous with respect to the Lebesgue measure (i.e. \u03c0Leb(A) = 0 \u21d2 P\u03b8k (A) = 0), by construction\nfor any \ufb01nite k. Hence if the set of global optimizers has zero Lebesgue measure then it has zero\nmeasure also according to P\u03b8k. The set of global optimizers has however measure 1 according to\n\u03c0(\u221e). The distance kP\u03b8k \u2212 \u03c0(\u221e)kT V is then constantly 1 for any \ufb01nite k.\n\nIt is also worth mentioning that if the set of global optimizers has zero measure then asymp-\ntotic convergence to the zero-temperature distribution \u03c0(\u221e) can be proven only under the additional\nassumptions of continuity and differentiability of U [12, 13, 14, 15].\n\n3 Finite-time guarantees\n\nIn general, optimization algorithms for problems de\ufb01ned on continuous variables can only \ufb01nd ap-\nproximate solutions in \ufb01nite time [27]. Given an element \u03b8 of a continuous domain how can we\nassess how good it is as an approximate solution to an optimization problem? Here we introduce\nthe concept of approximate global optimizer to answer this question. The de\ufb01nition is given for\na maximization problem in a continuous but bounded domain. We use two parameters: the value\nimprecision \u01eb (greater than or equal to 0) and the residual domain \u03b1 (between 0 and 1) which to-\ngether determine the level of approximation. We say that \u03b8 is an approximate global optimizer of U\nwith value imprecision \u01eb and residual domain \u03b1 if the function U takes values strictly greater than\nU (\u03b8) + \u01eb only on a subset of values of \u03b8 no larger than an \u03b1 portion of the optimization domain. The\nformal de\ufb01nition is as follows.\n\nDe\ufb01nition 1 Let U : \u0398 \u2192 R be an optimization criterion where \u0398 \u2282 RN is bounded. Let \u03c0Leb\ndenote the standard Lebesgue measure. Let \u01eb \u2265 0 and \u03b1 \u2208 [0, 1] be given numbers. Then \u03b8 is an\napproximate global optimizer of U with value imprecision \u01eb and residual domain \u03b1 if \u03c0Leb{\u03b8\u2032 \u2208 \u0398 :\nU (\u03b8\u2032) > U (\u03b8) + \u01eb} \u2264 \u03b1 \u03c0Leb(\u0398) .\n\nIn other words, the value U (\u03b8) is within \u01eb of a value which is greater than the values that U takes\non at least a 1 \u2212 \u03b1 portion of the domain. The smaller \u01eb and \u03b1 are, the better is the approximation\nof a true global optimizer. If both \u03b1 and \u01eb are equal to zero then U (\u03b8) coincides with the essential\nsupremum of U.\n\nOur de\ufb01nition of approximate global optimizer carries an important property, which holds re-\ngardless of what the criterion U is: if \u01eb and \u03b1 have non-zero values then the set of approximate\nglobal optimizers always has non-zero Lebesgue measure. It follows that the probability that the\nchain visits the set of approximate global optimizers can be non-zero. Hence, it is sensible to study\nthe con\ufb01dence of the fact that the solution found by simulated annealing in \ufb01nite time is an approx-\nimate global optimizer.\n\nRemark 3: The intuition that our notion of approximate global optimizer can be used to obtain\nformal guarantees on the \ufb01nite-time performance of optimization methods based on a stochastic\nsearch of the domain is already apparent in the work of Vidyasagar [17, 28]. Vidyasagar [17, 28]\nintroduces a similar de\ufb01nition and obtains rigorous \ufb01nite-time guarantees in the optimization of ex-\n\n\fpected value criteria based on uniform independent sampling of the domain. Notably, the number\nof independent samples required to guarantee some desired accuracy and con\ufb01dence turns out to be\npolynomial in the values of the desired imprecision, residual domain and con\ufb01dence. Although the\nmethod of Vidyasagar is not highly sophisticated, it has had considerable success in solving dif\ufb01cult\ncontrol system design applications [28, 29]. Its appeal stems from its rigorous \ufb01nite-time guarantees\nwhich exist without the need for any particular assumption on the optimization criterion.\n\nHere we show that \ufb01nite-time guarantees for simulated annealing can be obtained by selecting a\ndistribution \u03c0(J) with a \ufb01nite J as the target distribution in place of the zero-temperature distribution\n\u03c0(\u221e). The fundamental result is the following theorem which allows one to select in a rigorous way\n\u03b4 and J in the target distribution \u03c0(J). It is important to stress that the result holds universally for\nany optimization criterion U on a bounded domain. The only minor requirement is that U takes\nvalues in [0, 1].\n\nTheorem 1 Let U : \u0398 \u2192 [0, 1] be an optimization criterion where \u0398 \u2282 RN is bounded. Let\nJ \u2265 1 and \u03b4 > 0 be given numbers. Let \u03b8 be a multivariate random variable with distribution\n\u03c0(J)(d\u03b8) \u221d [U (\u03b8) + \u03b4]J \u03c0Leb(d\u03b8). Let \u03b1 \u2208 (0, 1] and \u01eb \u2208 [0, 1] be given numbers and de\ufb01ne\n\n\u03c3 =\n\n1\n\n1 +(cid:20) 1 + \u03b4\n\n\u01eb + 1 + \u03b4(cid:21) J (cid:20) 1\n\n\u03b1\n\n.\n\n1 + \u03b4\n\u01eb + \u03b4\n\n\u2212 1(cid:21) 1 + \u03b4\n\n\u03b4\n\n(1)\n\nThen the statement \u201c\u03b8 is an approximate global optimizer of U with value imprecision \u01eb and residual\ndomain \u03b1\u201d holds with probability at least \u03c3.\nProof. See Appendix A.\n\nThe importance of the choice of a target distribution \u03c0(J) with a \ufb01nite J is that \u03c0(J) is absolutely\ncontinuous with respect to the Lebesgue measure. Hence, the distance kP\u03b8k \u2212 \u03c0(J)kTV between the\ndistribution of the state of the chain P\u03b8k and the target distribution \u03c0(J) is a meaningful quantity.\n\nConvergence of the Metropolis-Hastings algorithm and MCMC methods in total variation norm\nis a well studied problem. The theory provides simple conditions under which one derives upper\nbounds on the distance to the target distribution which are known at each step of the chain and\ndecrease monotonically to zero as the number of steps of the chain grows. The theory has been\ndeveloped mainly for homogeneous chains [18, 19, 20, 21].\n\nIn the case of simulated annealing, the factor that enables us to employ these results is the abso-\nlute continuity of the target distribution \u03c0(J) with respect to the Lebesgue measure. However, simu-\nlated annealing involves the simulation of inhomogeneous chains. In this respect, another important\nfact is that the choice of a target distribution \u03c0(J) with a \ufb01nite J implies that the inhomogeneous\nMarkov chain can in fact be formed by a \ufb01nite sequence of homogeneous chains (i.e. the cooling\nschedule {Jk}k=1,2,... can be chosen to be a sequence that takes only a \ufb01nite set of values). In turn,\nthis allows one to apply the theory of homogeneous MCMC methods to study the convergence of\nP\u03b8k to \u03c0(J) in total variation norm.\n\nOn a bounded domain, simple conditions on the \u2018proposal distribution\u2019 in the iteration of the sim-\nulated annealing algorithm allows one to obtain upper bounds on kP\u03b8k \u2212 \u03c0(J)kTV that decrease geo-\nmetrically to zero as k \u2192 \u221e, without the need for any additional assumption on U [18, 19, 20, 21].\n\nIt is then appropriate to introduce the following \ufb01nite-time result.\n\nTheorem 2 Let the notation and assumptions of Theorem 1 hold. Let \u03b8k, with distribution P\u03b8k, be\nthe state of the inhomogeneous chain of a simulated annealing algorithm with target distribution\n\u03c0(J). Then the statement \u201c\u03b8k is an approximate global optimizer of U with value imprecision \u01eb and\nresidual domain \u03b1\u201d holds with probability at least \u03c3 \u2212 kP\u03b8k \u2212 \u03c0(J)kTV.\n\nThe proof of the theorem follows directly from the de\ufb01nition of the total variation norm.\n\nIt follows that if simulated annealing is implemented with an algorithm which converges in total\nvariation distance to a target distribution \u03c0(J) with a \ufb01nite J, then one can state with con\ufb01dence\narbitrarily close to 1 that the solution found by the algorithm after the known appropriate \ufb01nite\nnumber of steps is an approximate global optimizer with the desired approximation level. For given\nnon-zero values of \u01eb, \u03b1 the value of \u03c3 given by (1) can be made arbitrarily close to 1 by choice of\nJ; while the distance kP\u03b8k \u2212 \u03c0(J)kTV can be made arbitrarily small by taking the known suf\ufb01cient\nnumber of steps.\n\n\fIt can be shown that there exists the possibility of making an optimal choice of \u03b4 and J in the\ntarget distribution \u03c0(J). In fact, for given \u01eb and \u03b1 and a given value of J there exists an optimal\nchoice of \u03b4 which maximizes the value of \u03c3 given by (1). Hence, it is possible to obtain a desired \u03c3\nwith the smallest possible J. The advantage of choosing the smallest J, consistent with the required\napproximation and con\ufb01dence, is that it will decrease the number of steps required to achieve the\ndesired reduction of kP\u03b8k \u2212 \u03c0(J)kTV.\n\n4 Conclusions\n\nWe have introduced a new formulation of simulated annealing which admits rigorous \ufb01nite-time\nguarantees in the optimization of functions of continuous variables. First, we have introduced the\nnotion of approximate global optimizer. Then, we have shown that simulated annealing is guaranteed\nto \ufb01nd approximate global optimizers, with the desired con\ufb01dence and the desired level of accuracy,\nin a known \ufb01nite number of steps, if a proper choice of the target distribution is made and conditions\nfor convergence in total variation norm are met. The results hold for any optimization criterion on a\nbounded domain with the only minor requirement that it takes values between 0 and 1.\n\nIn this framework, simulated annealing algorithms with rigorous \ufb01nite-time guarantees can be\nderived by studying the choice of the proposal distribution and of the cooling schedule, in the generic\niteration of simulated annealing, in order to ensure convergence to the target distribution in total\nvariation norm. To do this, existing theory of convergence of the Metropolis-Hastings algorithm and\nMCMC methods on continuous domains can be used [18, 19, 20, 21].\n\nVidyasagar [17, 28] has introduced a similar de\ufb01nition of approximate global optimizer and has\nshown that approximate optimizers with desired accuracy and con\ufb01dence can be obtained with a\nnumber of uniform independent samples of the domain which is polynomial in the accuracy and\ncon\ufb01dence parameters.\nIn general, algorithms developed with the MCMC methodology can be\nexpected to be equally or more ef\ufb01cient than uniform independent sampling.\n\nAcknowledgments\nWork supported by EPSRC, Grant EP/C014006/1, and by the European Commission under projects\nHYGEIA FP6-NEST-4995 and iFly FP6-TREN-037180. We thank S. Brooks, M. Vidyasagar and\nD. M. Wolpert for discussions and useful comments on the paper.\n\nA Proof of Theorem 1\n\nLet \u00af\u03b1 \u2208 (0, 1] and \u03c1 \u2208 (0, 1] be given numbers. Let U\u03b4(\u03b8) := U (\u03b8) + \u03b4. Let \u03c0\u03b4 be a normalized\nmeasure such that \u03c0\u03b4(d\u03b8) \u221d U\u03b4(\u03b8)\u03c0Leb(d\u03b8). In the \ufb01rst part of the proof we \ufb01nd a lower bound on\nthe probability that \u03b8 belongs to the set {\u03b8 \u2208 \u0398 : \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > U\u03b4(\u03b8)} \u2264 \u00af\u03b1} .\n\nLet y \u00af\u03b1 := inf{y : \u03c0\u03b4{\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2264 y} \u2265 1 \u2212 \u00af\u03b1}. To start with we show that the set\n{\u03b8 \u2208 \u0398 : \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > U\u03b4(\u03b8)} \u2264 \u00af\u03b1} coincides with {\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2265 \u03c1 y \u00af\u03b1}. Notice\nthat the quantity \u03c0\u03b4{\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2264 y} is a right-continuous non-decreasing function of y because\nit has the form of a distribution function (see e.g. [30, p.162] and [17, Lemma 11.1]). Therefore we\nhave \u03c0\u03b4{\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2264 y \u00af\u03b1} \u2265 1 \u2212 \u00af\u03b1 and\ny \u2265 \u03c1 y \u00af\u03b1 \u21d2 \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) \u2264 y} \u2265 1 \u2212 \u00af\u03b1 \u21d2 \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > y} \u2264 \u00af\u03b1 .\nMoreover,\n\ny < \u03c1 y \u00af\u03b1 \u21d2 \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) \u2264 y} < 1 \u2212 \u00af\u03b1 \u21d2 \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > y} > \u00af\u03b1\n\nand taking the contrapositive one obtains\n\n\u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > y} \u2264 \u00af\u03b1 \u21d2 y \u2265 \u03c1 y \u00af\u03b1.\n\nTherefore {\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2265 \u03c1 y \u00af\u03b1} \u2261 {\u03b8 \u2208 \u0398 : \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > U\u03b4(\u03b8)} \u2264 \u00af\u03b1}.\nWe now derive a lower bound on \u03c0(J){\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2265 \u03c1 y \u00af\u03b1}. Let us introduce the notation\nA \u00af\u03b1 := {\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) < y \u00af\u03b1}, \u00afA \u00af\u03b1 := {\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2265 y \u00af\u03b1}, B \u00af\u03b1,\u03c1 := {\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) < \u03c1 y \u00af\u03b1}\nand \u00afB \u00af\u03b1,\u03c1 := {\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2265 \u03c1 y \u00af\u03b1}. Notice that B \u00af\u03b1,\u03c1 \u2286 A \u00af\u03b1 and \u00afA \u00af\u03b1 \u2286 \u00afB \u00af\u03b1,\u03c1. The quantity\n\u03c0\u03b4{\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) < y} as a function of y is the left-continuous version of \u03c0\u03b4{\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2264\ny}[30, p.162]. Hence, the de\ufb01nition of y \u00af\u03b1 implies \u03c0\u03b4(A \u00af\u03b1) \u2264 1 \u2212 \u00af\u03b1 and \u03c0\u03b4( \u00afA \u00af\u03b1) \u2265 \u00af\u03b1. Notice that\n\n\u03c0\u03b4(A \u00af\u03b1) \u2264 1 \u2212 \u00af\u03b1 \u21d2\n\n\u03b4\u03c0Leb(A \u00af\u03b1)\n\n(cid:2)R\u0398 U\u03b4(\u03b8)\u03c0Leb(d\u03b8)(cid:3)\n\n\u2264 1 \u2212 \u00af\u03b1 ,\n\n\f\u03c0\u03b4( \u00afA \u00af\u03b1) \u2265 \u00af\u03b1 \u21d2\n\nHence, \u03c0Leb( \u00afA \u00af\u03b1) > 0 and\n\n(1 + \u03b4)\u03c0Leb( \u00afA \u00af\u03b1)\n(cid:2)R\u0398 U\u03b4(\u03b8)\u03c0Leb(d\u03b8)(cid:3)\n\n\u2265 \u00af\u03b1 .\n\nNotice that \u03c0Leb( \u00afA \u00af\u03b1) > 0 implies \u03c0Leb( \u00afB \u00af\u03b1,\u03c1) > 0. We obtain\n\n\u03c0Leb(A \u00af\u03b1)\n\u03c0Leb( \u00afA \u00af\u03b1)\n\n\u2264\n\n1 \u2212 \u00af\u03b1\n\n1 + \u03b4\n\n\u00af\u03b1\n\n\u03b4\n\n.\n\n\u03c0(J){\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2265 \u03c1 y \u00af\u03b1} =\n\n1 + RB \u00af\u03b1,\u03c1\nR \u00afB \u00af\u03b1,\u03c1\n\n1\n\n1\n\nU\u03b4(\u03b8)J \u03c0Leb(d\u03b8)\nU\u03b4(\u03b8)J \u03c0Leb(d\u03b8)\n\n1\n\n1 + RB \u00af\u03b1,\u03c1\nR \u00afA \u00af\u03b1\n\nU\u03b4(\u03b8)J \u03c0Leb(d\u03b8)\nU\u03b4(\u03b8)J \u03c0Leb(d\u03b8)\n\n\u2265\n\n1\n\n\u2265\n\n1 +\n\n\u03c1 J yJ\n\u00af\u03b1\n\nyJ\n\u00af\u03b1\n\n1\n\u03c0Leb(B \u00af\u03b1,\u03c1)\n\u03c0Leb( \u00afA \u00af\u03b1)\n\n\u2265\n\n1 + \u03c1 J \u03c0Leb(A \u00af\u03b1)\n\u03c0Leb( \u00afA \u00af\u03b1)\n\n\u2265\n\n1 + \u03c1 J 1 \u2212 \u00af\u03b1\n\n\u00af\u03b1\n\n.\n\n1 + \u03b4\n\n\u03b4\n\nSince {\u03b8 \u2208 \u0398 : U\u03b4(\u03b8) \u2265 \u03c1 y \u00af\u03b1} \u2261 {\u03b8 \u2208 \u0398 : \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > U\u03b4(\u03b8)} \u2264 \u00af\u03b1} the \ufb01rst part of\nthe proof is complete.\n\nIn the second part of the proof we show that the set {\u03b8 \u2208 \u0398 : \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) >\nU\u03b4(\u03b8)} \u2264 \u00af\u03b1} is contained in the set of approximate global optimizers of U with value imprecision\n\u02dc\u01eb := (\u03c1\u22121 \u2212 1)(1 + \u03b4) and residual domain \u02dc\u03b1 := 1+\u03b4\n\u02dc\u01eb+\u03b4 \u00af\u03b1. Hence, we show that {\u03b8 \u2208 \u0398 : \u03c0\u03b4{\u03b8\u2032 \u2208\n\u0398 : \u03c1 U\u03b4(\u03b8\u2032) > U\u03b4(\u03b8)} \u2264 \u00af\u03b1} \u2286 {\u03b8 \u2208 \u0398 : \u03c0Leb{\u03b8\u2032 \u2208 \u0398 : U (\u03b8\u2032) > U (\u03b8) + \u02dc\u01eb} \u2264 \u02dc\u03b1 \u03c0Leb(\u0398)}. We\nhave\n\nU (\u03b8\u2032) > U (\u03b8) + \u02dc\u01eb \u21d4 \u03c1 U\u03b4(\u03b8\u2032) > \u03c1 [U\u03b4(\u03b8) + \u02dc\u01eb] \u21d2 \u03c1 U\u03b4(\u03b8\u2032) > U\u03b4(\u03b8)\n\nwhich is proven by noticing that \u03c1 [U\u03b4(\u03b8) + \u02dc\u01eb] \u2265 U\u03b4(\u03b8) \u21d4 1 \u2212 \u03c1 \u2265 U (\u03b8)(1 \u2212 \u03c1)\nand U (\u03b8) \u2208 [0, 1]. Hence {\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > U\u03b4(\u03b8)} \u2287 {\u03b8\u2032 \u2208 \u0398 : U (\u03b8\u2032) > U (\u03b8) + \u02dc\u01eb} .\nTherefore \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > U\u03b4(\u03b8)} \u2264 \u00af\u03b1 \u21d2 \u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : U (\u03b8\u2032) > U (\u03b8) + \u02dc\u01eb} \u2264 \u00af\u03b1 . Let\nQ\u03b8,\u02dc\u01eb := {\u03b8\u2032 \u2208 \u0398 : U (\u03b8\u2032) > U (\u03b8) + \u02dc\u01eb} and notice that\n\n\u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : U (\u03b8\u2032) > U (\u03b8) + \u02dc\u01eb} =\n\nZQ\u03b8,\u02dc\u01eb\nZ\u0398\n\nU (\u03b8\u2032)\u03c0Leb(d\u03b8\u2032) + \u03b4\u03c0Leb(Q\u03b8,\u02dc\u01eb)\n\nU (\u03b8\u2032)\u03c0Leb(d\u03b8\u2032) + \u03b4\u03c0Leb(\u0398)\n\n.\n\nWe obtain\n\n\u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : U (\u03b8\u2032) > U (\u03b8) + \u02dc\u01eb} \u2264 \u00af\u03b1 \u21d2 \u02dc\u01eb \u03c0Leb(Q\u03b8,\u02dc\u01eb) + \u03b4\u03c0Leb(Q\u03b8,\u02dc\u01eb) \u2264 \u00af\u03b1(1 + \u03b4)\u03c0Leb(\u0398)\n\u21d2 \u03c0Leb{\u03b8\u2032 \u2208 \u0398 : U (\u03b8\u2032) > U (\u03b8) + \u02dc\u01eb} \u2264 \u02dc\u03b1 \u03c0Leb(\u0398) .\n\nHence we can conclude that\n\n\u03c0\u03b4{\u03b8\u2032 \u2208 \u0398 : \u03c1 U\u03b4(\u03b8\u2032) > U\u03b4(\u03b8)} \u2264 \u00af\u03b1 \u21d2 \u03c0Leb{\u03b8\u2032 \u2208 \u0398 : U (\u03b8\u2032) > U (\u03b8) + \u02dc\u01eb} \u2264 \u02dc\u03b1 \u03c0Leb(\u0398)\n\nand the second part of the proof is complete.\n\nWe have shown that given \u00af\u03b1 \u2208 (0, 1], \u03c1 \u2208 (0, 1], \u02dc\u01eb := (\u03c1\u22121 \u2212 1)(1 + \u03b4), \u02dc\u03b1 := 1+\u03b4\n\n\u02dc\u01eb+\u03b4 \u00af\u03b1 and\n\n\u03c3 :=\n\n1\n\n1 + \u03c1 J 1 \u2212 \u00af\u03b1\n\n\u00af\u03b1\n\n1 + \u03b4\n\n\u03b4\n\n=\n\n1\n\n1 +(cid:20) 1 + \u03b4\n\n\u02dc\u01eb + 1 + \u03b4(cid:21) J (cid:20) 1\n\n\u02dc\u03b1\n\n,\n\n1 + \u03b4\n\u02dc\u01eb + \u03b4\n\n\u2212 1(cid:21) 1 + \u03b4\n\n\u03b4\n\nthe statement \u201c\u03b8 is an approximate global optimizer of U with value imprecision \u02dc\u01eb and residual\ndomain \u02dc\u03b1\u201d holds with probability at least \u03c3. Notice that \u02dc\u01eb \u2208 [0, 1] and \u02dc\u03b1 \u2208 (0, 1] are linked through\na bijective relation to \u03c1 \u2208 [ 1+\u03b4\n1+\u03b4 ]. The statement of the theorem is eventually\nobtained by expressing \u03c3 as a function of desired \u02dc\u01eb = \u01eb and \u02dc\u03b1 = \u03b1.\n\n2+\u03b4 , 1] and \u00af\u03b1 \u2208 (0, \u02dc\u01eb+\u03b4\n\n(cid:3)\n\nReferences\n\n[1] D. J. Wales. Energy Landscapes. Cambridge University Press, Cambridge, UK, 2003.\n[2] D. Achlioptas, A. Naor, and Y. Peres. Rigorous location of phase transitions in hard optimization prob-\n\nlems. Nature, 435:759\u2013764, 2005.\n\n\f[3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by Simulated Annealing.\n\n220(4598):671\u2013680, 1983.\n\nScience,\n\n[4] E. Bonomi and J. Lutton. The N-city travelling salesman problem: statistical mechanics and the Metropo-\n\nlis algorithm. SIAM Rev., 26(4):551\u2013568, 1984.\n\n[5] Y. Fu and P. W. Anderson. Application of statistical mechanics to NP-complete problems in combinatorial\n\noptimization. J. Phys. A: Math. Gen., 19(9):1605\u20131620, 1986.\n\n[6] M. M\u00b4ezard, G. Parisi, and R. Zecchina. Analytic and Algorithmic Solution of Random Satis\ufb01ability\n\nProblems. Science, 297:812\u2013815, 2002.\n\n[7] P. M. J. van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel\n\nPublishing Company, Dordrecht, Holland, 1987.\n\n[8] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli. Convergence and \ufb01nite-time behavior of simulated\n\nannealing. Adv. Appl. Prob., 18:747\u2013771, 1986.\n\n[9] B. Hajek. Cooling schedules for optimal annealing. Math. Oper. Res., 13:311\u2013329, 1988.\n[10] J. Hannig, E. K. P. Chong, and S. R. Kulkarni. Relative Frequencies of Generalized Simulated Annealing.\n\nMath. Oper. Res., 31(1):199\u2013216, 2006.\n\n[11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004.\n[12] H. Haario and E. Saksman. Simulated annealing process in general state space. Adv. Appl. Prob., 23:866\u2013\n\n893, 1991.\n\n[13] S. B. Gelfand and S. K. Mitter. Simulated Annealing Type Algorithms for Multivariate Optimization.\n\nAlgorithmica, 6:419\u2013436, 1991.\n\n[14] C. Tsallis and D. A. Stariolo. Generalized simulated annealing. Physica A, 233:395\u2013406, 1996.\n[15] M. Locatelli. Simulated Annealing Algorithms for Continuous Global Optimization: Convergence Con-\n\nditions. J. Optimiz. Theory App., 104(1):121\u2013133, 2000.\n\n[16] V. N. Vapnik. The Nature of Statistical Learning Theory. Cambridge University Press, Springer, New\n\nYork, US, 1995.\n\n[17] M. Vidyasagar. Learning and Generalization: With Application to Neural Networks. Springer-Verlag,\n\nLondon, second edition, 2003.\n\n[18] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993.\n[19] J. S. Rosenthal. Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. Am.\n\nStat. Assoc., 90(430):558\u2013566, 1995.\n\n[20] K. L. Mengersen and R. L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithm.\n\nAnn. Stat., 24(1):101\u2013121, 1996.\n\n[21] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Prob.\n\nSurv., 1:20\u201371, 2004.\n\n[22] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition,\n\n2004.\n\n[23] D.J. Spiegelhalter, K.R. Abrams, and J.P. Myles. Bayesian approaches to clinical trials and health-care\n\nevaluation. John Wiley & Sons, Chichester, UK, 2004.\n\n[24] A. Lecchini-Visintini, W. Glover, J. Lygeros, and J. M. Maciejowski. Monte Carlo Optimization for\n\nCon\ufb02ict Resolution in Air Traf\ufb01c Control. IEEE Trans. Intell. Transp. Syst., 7(4):470\u2013482, 2006.\n\n[25] P. M\u00a8uller. Simulation based optimal design. In J. O. Berger, J. M. Bernardo, A. P. Dawid, and A. F. M.\nSmith, editors, Bayesian Statistics 6: proceedings of the Sixth Valencia International Meeting, pages\n459\u2013474. Oxford: Clarendon Press, 1999.\n\n[26] P. M\u00a8uller, B. Sans\u00b4o, and M. De Iorio. Optimal Bayesian design by Inhomogeneous Markov Chain Simu-\n\nlation. J. Am. Stat. Assoc., 99(467):788\u2013798, 2004.\n\n[27] L. Blum, C. Cucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer-Verlag, New\n\nYork, 1998.\n\n[28] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory.\n\nAutomatica, 37(10):1515\u20131528, 2001.\n\n[29] R. Tempo, G. Cala\ufb01ore, and F. Dabbene. Randomized Algorithms for Analysis and Control of Uncertain\n\nSystems. Springer-Verlag, London, 2005.\n\n[30] B.V. Gnedenko. Theory of Probability. Chelsea, New York, fourth edition, 1968.\n\n\f", "award": [], "sourceid": 63, "authors": [{"given_name": "Andrea", "family_name": "Lecchini-visintini", "institution": null}, {"given_name": "John", "family_name": "Lygeros", "institution": null}, {"given_name": "Jan", "family_name": "Maciejowski", "institution": null}]}