{"title": "Max-value Entropy Search for Multi-Objective Bayesian Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 7825, "page_last": 7835, "abstract": "We consider the problem of multi-objective (MO) blackbox optimization using expensive function evaluations, where the goal is to approximate the true Pareto-set of solutions by minimizing the number of function evaluations. For example, in hardware design optimization, we need to find the designs that trade-off performance, energy, and area overhead using expensive simulations. We propose a novel approach referred to as Max-value Entropy Search for Multi-objective Optimization (MESMO) to solve this problem. MESMO employs an output-space entropy based acquisition function to efficiently select the sequence of inputs for evaluation for quickly uncovering high-quality solutions.\n We also provide theoretical analysis to characterize the efficacy of MESMO. Our experiments on several synthetic and real-world benchmark problems show that MESMO consistently outperforms state-of-the-art algorithms.", "full_text": "Max-value Entropy Search for Multi-Objective\n\nBayesian Optimization\n\nSyrine Belakaria, Aryan Deshwal, Janardhan Rao Doppa\n\nSchool of EECS, Washington State University\n\n{syrine.belakaria, aryan.deshwal, jana.doppa}@wsu.edu\n\nAbstract\n\nWe consider the problem of multi-objective (MO) blackbox optimization using\nexpensive function evaluations, where the goal is to approximate the true pareto-set\nof solutions by minimizing the number of function evaluations. For example, in\nhardware design optimization, we need to \ufb01nd the designs that trade-off perfor-\nmance, energy, and area overhead using expensive computational simulations. In\nthis paper, we propose a novel approach referred as Max-value Entropy Search for\nMulti-objective Optimization (MESMO) to solve this problem. MESMO employs\nan output-space entropy based acquisition function to ef\ufb01ciently select the sequence\nof inputs for evaluation to quickly uncover high-quality pareto-set solutions. We\nalso provide theoretical analysis to characterize the ef\ufb01cacy of MESMO. Our\nexperiments on several synthetic and real-world benchmark problems show that\nMESMO consistently outperforms the state-of-the-art algorithms.\n\n1\n\nIntroduction\n\nMany engineering and scienti\ufb01c applications involve making design choices to optimize multiple\nobjectives. Some examples include tuning the knobs of a compiler to optimize performance and\nef\ufb01ciency of a set of software programs; and designing new materials to optimize strength, elasticity,\nand durability. There are two common challenges in solving this kind of optimization problems: 1)\nThe objective functions are unknown and we need to perform expensive experiments to evaluate\neach candidate design choice. For example, performing computational simulations and physical\nlab experiments for compiler optimization and material design applications respectively. 2) The\nobjectives are con\ufb02icting in nature and all of them cannot be optimized simultaneously. Therefore,\nwe need to \ufb01nd the Pareto optimal set of solutions. A solution is called Pareto optimal if it cannot be\nimproved in any of the objectives without compromising some other objective. The overall goal is to\napproximate the optimal Pareto set by minimizing the number of function evaluations.\nBayesian Optimization (BO) [22] is an effective framework to solve blackbox optimization problems\nwith expensive function evaluations. The key idea behind BO is to build a cheap surrogate model\n(e.g., Gaussian Process [28]) using the real experimental evaluations; and employ it to intelligently\nselect the sequence of function evaluations using an acquisition function, e.g., expected improvement\n(EI). There is a large body of literature on single-objective BO algorithms [22] and their applications\nincluding hyper-parameter tuning of machine learning methods [24, 12]. However, there is relatively\nless work on the more challenging problem of BO for multiple objective functions [7] as discussed in\nthe related work section.\nPrior work on multi-objective BO is lacking in the following ways. Many algorithms reduce\nthe problem to single-objective optimization by designing appropriate acquisition functions, e.g.,\nexpected improvement in Pareto hypervolume [11, 5]. Unfortunately, this choice is sub-optimal\nas it can potentially lead to aggressive exploitation behavior. Additionally, algorithms to optimize\nPareto Hypervolume (PHV) based acquisition functions scale poorly as the number of objectives and\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdimensionality of input space grows. Other method relies on input space entropy based acquisition\nfunction [7] to select the candidate inputs for evaluation. However, it is computationally expensive to\napproximate and optimize this acquisition function.\nIn this paper, we propose a novel and principled approach referred as Max-value Entropy Search for\nMulti-objective Optimization (MESMO) to overcome the drawbacks of prior work. MESMO employs\nan output space entropy based acquisition function to select the candidate inputs for evaluation. The\nkey idea is to evaluate the input that maximizes the information gain about the optimal Pareto front\nin each iteration. Output space entropy search has many advantages over algorithms based on input\nspace entropy search: a) allows much tighter approximation; b) signi\ufb01cantly cheaper to compute; and\nc) naturally lends itself to robust optimization. Indeed, our experiments demonstrate these advantages\nof MESMO. Our work is inspired by the recent success of single-objective BO algorithms based on\nthe idea of optimizing output-space information gain [26, 9], which are shown to be most ef\ufb01cient\nand robust among a family of information-theoretic acquisition functions [6, 8]. Speci\ufb01cally, we\nextend the max-value entropy search approach [26] to the challenging multi-objective setting.\nContributions. The main contributions of this paper are:\n\n\u2022 Developing a principled approach referred as MESMO to solve multi-objective blackbox\noptimization problems. MESMO employs an output space entropy based acquisition function\nto ef\ufb01ciently select the sequence of candidate inputs for evaluation.\n\n\u2022 Theoretical analysis of the MESMO algorithm in terms of asymptotic regret bounds.\n\u2022 Comprehensive experiments over diverse synthetic and real-world benchmark problems to\n\nshow accuracy and ef\ufb01ciency improvements over existing methods.\n\n2 Background and Problem Setup\n\nBayesian Optimization (BO) Framework. BO is a very ef\ufb01cient framework to solve global opti-\nmization problems using black-box evaluations of expensive objective functions. Let X \u2286 (cid:60)d be an\ninput space. In single-objective BO formulation, we are given an unknown real-valued objective\nfunction f : X (cid:55)\u2192 (cid:60), which can evaluate each input x \u2208 X to produce an evaluation y = f (x). Each\nevaluation f (x) is expensive in terms of the consumed resources. The main goal is to \ufb01nd an input\nx\u2217 \u2208 X that approximately optimizes f by performing a limited number of function evaluations. BO\nalgorithms learn a cheap surrogate model from training data obtained from past function evaluations.\nThey intelligently select the next input for evaluation by trading-off exploration and exploitation to\nquickly direct the search towards optimal inputs. The three key elements of BO framework are:\n\n1) Statistical Model of the true function f (x). Gaussian Process (GP) [28] is the most commonly\nused model. A GP over a space X is a random process from X to (cid:60). It is characterized by a mean\nfunction \u00b5 : X (cid:55)\u2192 (cid:60) and a covariance or kernel function \u03ba : X \u00d7 X (cid:55)\u2192 (cid:60). If a function f is sampled\nfrom GP(\u00b5, \u03ba), then f (x) is distributed normally N (\u00b5(x), \u03ba(x, x)) for a \ufb01nite set of inputs from\nx \u2208 X .\n\n2) Acquisition Function (\u03b1) to score the utility of evaluating a candidate input x \u2208 X based on\nthe statistical model. Some popular acquisition functions in the single-objective literature include\nexpected improvement (EI), upper con\ufb01dence bound (UCB), predictive entropy search (PES) [8], and\nmax-value entropy search (MES) [26].\n\n3) Optimization Procedure to select the best scoring candidate input according to \u03b1 depending\non statistical model. DIRECT [10] is a very popular approach for acquisition function optimization.\n\nMulti-Objective Optimization (MOO) Problem. Without loss of generality, our goal is to minimize\nreal-valued objective functions f1(x), f2(x),\u00b7\u00b7\u00b7 , fK(x), with K \u2265 2, over continuous space X \u2286\n(cid:60)d. Each evaluation of an input x \u2208 X produces a vector of objective values y = (y1, y2,\u00b7\u00b7\u00b7 , yK)\nwhere yi = fi(x) for all i \u2208 {1, 2,\u00b7\u00b7\u00b7 , K}. We say that a point x Pareto-dominates another point x(cid:48)\nif fi(x) \u2264 fi(x(cid:48)) \u2200i and there exists some j \u2208 {1, 2,\u00b7\u00b7\u00b7 , K} such that fj(x) < fj(x(cid:48)). The optimal\nsolution of MOO problem is a set of points X \u2217 \u2282 X such that no point x(cid:48) \u2208 X\\X \u2217 Pareto-dominates\na point x \u2208 X \u2217. The solution set X \u2217 is called the optimal Pareto set and the corresponding set of\nfunction values Y\u2217 is called the optimal Pareto front. Our goal is to approximate X \u2217 by minimizing\nthe number of function evaluations.\n\n2\n\n\f3 Related work\n\nThere is a family of model based multi-objective BO algorithms that reduce the problem to single-\nobjective optimization. ParEGO method [11] employs random scalarization for this purpose: scalar\nweights of K objective functions are sampled from a uniform distribution to construct a single-\nobjective function and expected improvement is employed as the acquisition function to select the\nnext input for evaluation. ParEGO is simple and fast, but more advanced approaches often outperform\nit. Many methods optimize the Pareto hypervolume (PHV) metric [5] that captures the quality of a\ncandidate Pareto set. This is done by extending the standard acquisition functions to PHV objective,\ne.g., expected improvement in PHV (EHI) [5] and probability of improvement in PHV (SUR)[17].\nUnfortunately, algorithms to optimize PHV based acquisition functions scale very poorly and are\nnot feasible for more than two objectives. SMSego is a relatively faster method [19]. To improve\nscalability, the gain in hypervolume is computed over a limited set of points: SMSego \ufb01nds those\nset of points by optimizing the posterior means of the GPs. A common drawback of this family of\nalgorithms is that reduction to single-objective optimization can potentially lead to more exploitation\nbehavior resulting in sub-optimal solutions.\nPAL [31] and PESMO [7] are principled algorithms based on information theory. PAL tries to\nclassify the input points based on the learned models into three categories: Pareto optimal, non-Pareto\noptimal, and uncertain. In each iteration, it selects the candidate input for evaluation towards the\ngoal of minimizing the size of uncertain set. PAL provides theoretical guarantees, but it is only\napplicable for input space X with \ufb01nite set of discrete points. PESMO [7] relies on input space\nentropy based acquisition function and iteratively selects the input that maximizes the information\ngained about the optimal Pareto set X \u2217. Unfortunately, optimizing this acquisition function poses\nsigni\ufb01cant challenges: a) requires a series of approximations, which can be potentially sub-optimal;\nand b) optimization, even after approximations, is expensive c) performance is strongly dependent\non the number of Monte-Carlo samples. In comparison, our proposed output space entropy based\nacquisition function overcomes the above challenges, and allows ef\ufb01cient and robust optimization.\nMore speci\ufb01cally, the time complexities of acquisition function computation in PESMO and MESMO\nignoring the time to solve cheap MO problem that is common for both algorithms are O(SKm3)\nand O(SK) respectively, where S is the number of Monte-Carlo samples, K is the number of\nobjectives, and m is the size of the sample Pareto set in PESMO. Additionally, as demonstrated in\nour experiments, MESMO is very robust and performs very well even with one sample.\n\n4 MESMO Algorithm for Multi-Objective Optimization\n\nIn this section, we explain the technical details of our proposed MESMO algorithm. We \ufb01rst mathe-\nmatically describe the output space entropy based acquisition function and provide an algorithmic\napproach to ef\ufb01ciently compute it. Subsequently, we theoretically analyze MESMO in terms of\nasymptotic regret bounds.\nSurrogate models. Gaussian processes (GPs) are shown to be effective surrogate models in prior\nwork on single and multi-objective BO [8, 27, 26, 25, 7]. Similar to prior work [7], we model the\nobjective functions f1, f2,\u00b7\u00b7\u00b7 , fK using K independent GP models M1,M2,\u00b7\u00b7\u00b7 ,MK with zero\nmean and i.i.d. observation noise. Let D = {(xi, yi)}t\u22121\ni=1 be the training data from past t\u22121 function\nevaluations, where xi \u2208 X is an input and yi = {y1\ni } is the output vector resulting from\ni ,\u00b7\u00b7\u00b7 , yK\ni , y2\nevaluating functions f1, f2,\u00b7\u00b7\u00b7 , fK at xi. We learn surrogate models M1,M2,\u00b7\u00b7\u00b7 ,MK from D.\nOutput space entropy based acquisition function. Input space entropy based methods like PESMO\n[7] selects the next candidate input xt (for ease of notation, we drop the subscript in below discussion)\nby maximizing the information gain about the optimal Pareto set X \u2217. The acquisition function based\non input space entropy is given as follows:\n\n\u03b1(x) = I({x, y},X \u2217 | D)\n\n= H(X \u2217 | D) \u2212 Ey[H(X \u2217 | D \u222a {x, y})]\n= H(y | D, x) \u2212 EX \u2217 [H(y | D, x,X \u2217)]\n\n(4.1)\n(4.2)\n(4.3)\n\nInformation gain is de\ufb01ned as the expected reduction in entropy H(.) of the posterior distribution\nP (X \u2217 | D) over the optimal Pareto set X \u2217 as given in Equations 4.2 and 4.3 (resulting from\nsymmetric property of information gain). This mathematical formulation relies on a very expensive\n\n3\n\n\fand high-dimensional (m \u00b7 d dimensions) distribution P (X \u2217 | D), where m is size of the optimal\nPareto set X \u2217. Furthermore, optimizing the second term in r.h.s poses signi\ufb01cant challenges: a)\nrequires a series of approximations [7] which can be potentially sub-optimal; and b) optimization,\neven after approximations, is expensive c) performance is strongly dependent on the number of\nMonte-Carlo samples.\nTo overcome the above challenges of computing input space entropy based acquisition function, we\ntake an alternative route and propose to maximize the information gain about the optimal Pareto\nfront Y\u2217. This is equivalent to expected reduction in entropy over the Pareto front Y\u2217, which relies\non a computationally cheap and low-dimensional (m \u00b7 K dimensions, which is signi\ufb01cantly less than\nm \u00b7 d as K (cid:28) d in practice) distribution P (Y\u2217 | D). Our acquisition function that maximizes the\ninformation gain between the next candidate input for evaluation x and Pareto front Y\u2217 is given as:\n(4.4)\n(4.5)\n(4.6)\n\n= H(Y\u2217 | D) \u2212 Ey[H(Y\u2217 | D \u222a {x, y})]\n= H(y | D, x) \u2212 EY\u2217 [H(y | D, x,Y\u2217)]\n\n\u03b1(x) = I({x, y},Y\u2217 | D)\n\nThe \ufb01rst term in the r.h.s of equation 4.6 (entropy of a factorizable K-dimensional gaussian distribution\nP (y | D, x)) can be computed in closed form as shown below:\n\nH(y | D, x) =\n\nK(1 + ln(2\u03c0))\n\n2\n\n+\n\nln(\u03c3i(x))\n\n(4.7)\n\nK(cid:88)\n\ni=1\n\nS(cid:88)\n\ns=1\n\nwhere \u03c32\ni (x) is the predictive variance of ith GP at input x. The second term in the r.h.s of equation 4.6\nis an expectation over the Pareto front Y\u2217. We can approximately compute this term via Monte-Carlo\nsampling as shown below:\n\nEY\u2217 [H(y | D, x,Y\u2217)] (cid:39) 1\nS\n\n[H(y | D, x,Y\u2217\ns )]\n\n(4.8)\n\nwhere S is the number of samples and Y\u2217\ns denote a sample Pareto front. The main advantages of our\nacquisition function are: computational ef\ufb01ciency and robustness to the number of samples. Our\nexperiments demonstrate these advantages over input space entropy based acquisition function.\nThere are two key algorithmic steps to compute Equation 4.8: 1) How to compute Pareto front\nsamples Y\u2217\ns ?; and 2) How to compute the entropy with respect to a given Pareto front sample Y\u2217\ns ?\nWe provide solutions for these two questions below.\n\n1) Computing Pareto front samples via cheap multi-objective optimization. To compute a\nPareto front sample Y\u2217\ns , we \ufb01rst sample functions from the posterior GP models via random fourier\nfeatures [8, 20] and then solve a cheap multi-objective optimization over the K sampled functions.\nSampling functions from posterior GP. Similar to prior work [8, 7, 26], we employ random\nfourier features based sampling procedure. We approximate each GP prior as \u02dcf = \u03c6(x)T \u03b8, where\n\u03b8 \u223c N (0, I). The key idea behind random fourier features is to construct each function sample\n\u02dcf (x) as a \ufb01nitely parametrized approximation: \u03c6(x)T \u03b8, where \u03b8 is sampled from its corresponding\nposterior distribution conditioned on the data D obtained from past function evaluations: \u03b8|D \u223c\nN (A\u22121\u03a6Tyn, \u03c32A\u22121), where A = \u03a6T\u03a6 + \u03c32I and \u03a6T = [\u03c6(x1),\u00b7\u00b7\u00b7 , \u03c6(xt\u22121)].\n\nCheap MO solver. We sample \u02dcfi from GP model Mi for each of the K functions as described\nabove. A cheap multi-objective optimization problem over the K sampled functions \u02dcf1, \u02dcf2,\u00b7\u00b7\u00b7 , \u02dcfk\nis solved to compute sample Pareto front Y\u2217\ns . This cheap multi-objective optimization also allows us\nto capture the interactions between different objectives. We employ the popular NSGA-II algorithm\n[3] to solve the MO problem with cheap objective functions noting that any other algorithm can be\nused to similar effect.\n\ns = {z1,\u00b7\u00b7\u00b7 , zm} be the sample\n2) Entropy computation with a sample Pareto front. Let Y\u2217\nPareto front, where m is the size of the Pareto front and each zi = {z1\ni } is a K-vector\nevaluated at the K sampled functions. The following inequality holds for each component yj of the\nK-vector y = {y1,\u00b7\u00b7\u00b7 , yK} in the entropy term H(y | D, x,Y\u2217\ns ):\n\ni ,\u00b7\u00b7\u00b7 , zK\n\nyj \u2264 max{zj\n\n1,\u00b7\u00b7\u00b7 zj\n\nm} \u2200j \u2208 {1,\u00b7\u00b7\u00b7 , K}\n\n(4.9)\n\n4\n\n\fThe inequality essentially says that the jth component of y (i.e., yj) is upper-bounded by a value\nobtained by taking the maximum of jth components of all m K-vectors in the Pareto front Y\u2217\ns . This\ninequality can be proven by a contradiction argument. Suppose there exists some component yj of\nm}. However, by de\ufb01nition, y is a non-dominated point because no\n1,\u00b7\u00b7\u00b7 zj\ny such that yj > max{zj\npoint dominates it in the jth dimension. This results in y \u2208 Y\u2217\ns which is a contradiction. Therefore,\nour hypothesis that yj > max{zj\n1,\u00b7\u00b7\u00b7 zj\nBy combining the inequality 4.9 and the fact that each function is modeled as a GP, we can model\neach component yj as a truncated Gaussian distribution since the distribution of yj needs to satisfy\nyj \u2264 max{zj\nm}. Furthermore, a common property of entropy measure allows us to decompose\nthe entropy of a set of independent variables into a sum over entropies of individual variables [2]:\n\nm} is incorrect and inequality 4.9 holds.\n\n1,\u00b7\u00b7\u00b7 zj\n\nH(y | D, x,Y\u2217\n\nH(yj|D, x, max{zj\n\n1,\u00b7\u00b7\u00b7 zj\n\nm})\n\n(4.10)\n\ns ) (cid:39) K(cid:88)\n\nj=1\n\nEquation 4.10 and the fact that the entropy of a truncated Gaussian distribution[14] can be computed\nin closed form gives the following mathematical expression for the entropy term H(y | D, x,Y\u2217\ns ).\nWe provide the complete details of the derivation in the Appendix.\n\nH(y | D, x,Y\u2217\n\n(1 + ln(2\u03c0))\n\n2\n\n+ ln(\u03c3j(x)) + ln \u03a6(\u03b3j\n\ns(x)) \u2212 \u03b3j\n\ns(x))\n\ns(x)\u03c6(\u03b3j\n2\u03a6(\u03b3j\n\ns(x))\n\n(4.11)\n\n(cid:34)\n\ns ) (cid:39) K(cid:88)\n\nj=1\n\n(cid:35)\n\ns(x) = yj\u2217\n\nm}, and \u03c6 and \u03a6 are the p.d.f and c.d.f of a standard\nwhere \u03b3j\nnormal distribution respectively. By combining equations 4.7 and 4.11 with Equation 4.6, we get the\n\ufb01nal form of our acquisition function as shown below:\n\ns \u2212\u00b5j (x)\n\u03c3j (x)\n\n, yj\u2217\n\n1,\u00b7\u00b7\u00b7 zj\ns = max{zj\n(cid:34)\nK(cid:88)\nS(cid:88)\n\n(cid:35)\n\n\u03b1(x) (cid:39) 1\nS\n\ns=1\n\nj=1\n\ns(x)\u03c6(\u03b3j\n\u03b3j\n2\u03a6(\u03b3j\n\ns(x))\n\ns(x))\n\n\u2212 ln \u03a6(\u03b3j\n\ns(x))\n\n(4.12)\n\nA complete description of the MESMO algorithm is given in Algorithm 1. The blue colored steps\ncorrespond to computation of our output space entropy based acquisition function via sampling.\n\nfor each sample s \u2208 1,\u00b7\u00b7\u00b7 , S:\n\nSelect xt \u2190 arg maxx\u2208X \u03b1t(x), where \u03b1t(.) is computed as:\n\nAlgorithm 1 MESMO Algorithm\nInput: input space X; K blackbox objective functions f1(x), f2(x),\u00b7\u00b7\u00b7 , fK(x); and maximum no.\nof iterations Tmax\n1: Initialize Gaussian process models M1,M2,\u00b7\u00b7\u00b7 ,MK by evaluating at N0 initial points\n2: for each iteration t = N0 + 1 to Tmax do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n13: return Pareto front of f1(x), f2(x),\u00b7\u00b7\u00b7 , fK(x) based on D\n\nSample \u02dcfi \u223c Mi,\nY\u2217\ns \u2190 Pareto front of cheap multi-objective optimization over ( \u02dcf1,\u00b7\u00b7\u00b7 , \u02dcfK)\n\nEvaluate xt: yt \u2190 (f1(xt),\u00b7\u00b7\u00b7 , fK(xt))\nAggregate data: D \u2190 D \u222a {(xt, yt)}\nUpdate models M1,M2,\u00b7\u00b7\u00b7 ,MK\nt \u2190 t + 1\n\nCompute \u03b1t(.) based on the S samples of Y\u2217\n\ns as given in Equation 4.12\n\n\u2200i \u2208 {1,\u00b7\u00b7\u00b7 , K}\n\n4.1 Theoretical Analysis\n\nIn this section, we provide a theoretical analysis for the behavior of MESMO algorithm. Multi-\nobjective optimization literature has multiple metrics to assess the quality of Pareto front approxi-\nmation. The two commonly employed metrics include Pareto Hypervolume indicator [29] and R2\nindicator[18]. R2 indicator is a natural extension of the cumulative regret measure in single-objective\n\n5\n\n\fBO as proposed in the well-known work by Srinivasan et al., [25] to prove convergence results. Prior\nwork [17] has shown that R2 and Pareto Hypervolume indicator show similar behavior. Indeed,\nour experiments validate this claim for MESMO. Therefore, we present the theoretical analysis of\nMESMO with respect to R2 indicator. Let x\u2217 be a point in the optimal Pareto set X \u2217. Let xt be a\npoint selected for evaluation by MESMO at the tth iteration. Let R(x\u2217) = (cid:107)R1,\u00b7\u00b7\u00b7 , RK(cid:107), where\nt=1(fj(x\u2217) \u2212 fj(xt)) and (cid:107).(cid:107) is the norm of the K-vector. We discuss asymptotic bounds\nfor this measure over the input set X.\nTheorem 1. Let P be a distribution over vector [y1\u2217,\u00b7\u00b7\u00b7 , yK\u2217], where each yj\u2217 is the maximum value\nfor function fj among the vectors in the Pareto front obtained by solving the cheap multi-objective op-\ntimization problem over sampled functions from the K Gaussian process models. Let the observation\n\nRj =(cid:80)T (cid:48)\nnoise for function evaluations be i.i.d N (0, \u03c3) and w = P r[(cid:0)y1\u2217 > f1(x\u2217)(cid:1) ,\u00b7\u00b7\u00b7 ,(cid:0)yK\u2217 > fK(x\u2217)(cid:1)].\n[y1\u2217,\u00b7\u00b7\u00b7 , yK\u2217] is drawn from P , then with probability atleast 1 \u2212 \u03b4, in T (cid:48) = (cid:80)T\n\nIf xt is the candidate input selected by MESMO at the tth iteration according to 4.12 and\n\u03b4\n2\u03c0i\n\ni=1 logw\n\nnumber of iterations\n\n(cid:17)2(cid:32)\nwhere \u03b6T = (2 log(\u03c0T /\u03b4))1/2, \u03c0i > 0, and (cid:80)T\n\nR(x\u2217) =\n\n(cid:32)(cid:16)\n\n(cid:118)(cid:117)(cid:117)(cid:116) K(cid:88)\n\nvj\nt\u2217 + \u03b6T\n\nj=1\n\n2T \u03b3j\nT\n\nlog(1 + \u03c3\u22122)\n\u2264 1, vj\n\n(cid:33)(cid:33)\n\n(4.13)\n\nyj\u2217\u2212\u00b5j,t\u22121(x)\n\nminx\u2208X\ntion evaluations.\n\n\u03c3j,t\u22121(x)\n\nt =\nT is the maximum information gain about function fj after T func-\n\nt\u2217 = maxt vj\n\nt with vj\n\n1\n\u03c0i\n\ni=1\n\n, and \u03b3j\n\nWe provide details of the proof in the Appendix. The key message of this result is that since each\nterm Rj in R(x\u2217) grows sub-linearly in the asymptotic sense, R(x\u2217) which is de\ufb01ned as the norm\nalso grows sub-linearly.\n\n5 Experiments and Results\n\nIn this section, we describe our experimental setup, present results of MESMO on diverse synthetic\nand real-world benchmarks, and compare MESMO with existing methods.\n\n5.1 Experimental Setup\n\nMulti-objective BO algorithms. We compare MESMO with existing methods described in the\nrelated work: ParEGO [11], PESMO [7], SMSego [19], EHI [5], and SUR [17]. We employ the\ncode for these methods from the BO library Spearmint1. For methods requiring PHV computation,\nwe employ the PyGMO library2. According to PyGMO documentation, the algorithm from [15]\nis employed for PHV computation. We did not include PAL [31] as it is known to have similar\nperformance as SMSego [7] and works only for \ufb01nite discrete input space.\nStatistical models. We use a GP based statistical model with squared exponential (SE) kernel in\nall our experiments. The hyper-parameters are estimated after every 5 function evaluations. We\ninitialize the GP models for all functions by sampling initial points at random from a Sobol grid. This\ninitialization procedure is same as the one in-built in the Spearmint library.\nSynthetic benchmarks. We construct two synthetic multi-objective benchmark problems using a\ncombination of commonly employed benchmark functions for single-objective optimization3. We\nalso employ two benchmarks from the general multi-objective optimization literature [16, 4]. We\nprovide the complete details of these MO benchmarks below.\n\n1) BC-2,2: We evaluate two benchmark functions Branin and Currin. The dimension of input\n\nspace d is 2.\n\n2) PRDZPS-6,6: We evaluate six benchmark functions, namely, Powell, Rastrigin, Dixon, Za-\n\nkharov, Perm, and SumSquares. The dimension of input space d is 6.\n\n1https://github.com/HIPS/Spearmint/tree/PESM\n2https://esa.github.io/pygmo/\n3https://www.sfu.ca/ ssurjano/optimization.html\n\n6\n\n\fFigure 1: Results of different multi-objective BO algorithms including MESMO on synthetic bench-\nmarks. The log of the hypervolume difference and the R2 Indicator are shown with different number\nof function evaluations. The mean and variance of 10 different runs are plotted. The title of each\n\ufb01gure refers to the name of benchmark. (Figures better seen in color.)\n\n3) OKA2-2,3: We evaluate two functions de\ufb01ned in [16]. The dimension of input space d is 3.\n4) DTLZ1-4,5: We evaluate four functions de\ufb01ned in [4]. The dimension of input space d is 5.\n\nReal-world benchmarks. We employed four real-world benchmarks with data available at [31, 21].\n1) Hyper-parameter tuning of neural networks. In this benchmark, our goal is to \ufb01nd a neural\nnetwork with high accuracy and low prediction time. We optimize a dense neural network over the\nMNIST dataset [13]. Hyper-parameters include the number of hidden layers, the number of neurons\nper layer, the dropout probability, the learning rate, and the regularization weight penalties l1 and l2.\nWe employ 10K instances for validation and 50K instances for training. We train the network for 100\nepochs for evaluating each candidate hyper-parameter values on validation set. We apply a logarithm\nfunction to error rates due to their very small values.\n\n2) SW-LLVM compiler settings optimization. SW-LLVM is a data set with 1024 compiler\nsettings [23] determined by d=10 binary inputs. The goal of this experiment is to \ufb01nd a setting of the\nLLVM compiler that optimizes the memory footprint and performance on a given set of software\nprograms. Evaluating these objectives is very costly and testing all the compiler settings takes days.\n3) SNW sorting network optimization. The data set SNW was \ufb01rst introduced by [30]. The goal\nis to optimize the area and throughput for the synthesis of a \ufb01eld-programmable gate array (FPGA)\nplatform. The input space consists of 206 different hardware design implementations of a sorting\nnetwork. Each design is de\ufb01ned by d = 4 input variables.\n\n4) Network-on-chip (NOC) optimization. The design space of NoC dataset [1] consists of 259\nimplementations of a tree-based network-on-chip. Each con\ufb01guration is de\ufb01ned by d = 4 variables:\nwidth, complexity, FIFO, and multiplier. We optimize energy and runtime of application-speci\ufb01c\nintegrated circuits (ASICs) on the Coremark benchmark workload [1].\nEvaluation metrics. We employ two common metrics used in practice.\n\n1) The Pareto hypervolume (PHV) is commonly employed to measure the quality of a given\nPareto front [29]. PHV is de\ufb01ned as the volume between a reference point and the given Pareto front.\nAfter each iteration t , we report the difference between the hypervolume of the ideal Pareto front\n(Y\u2217) and hypervolume of the estimated Pareto front (Yt) by a given algorithm.\n\nP HVdif f = P HV (Y\u2217) \u2212 P HV (Yt)\n\n(5.1)\n2) R2 Indicator is the average distance the ideal Pareto front (Y\u2217) and the estimated Pareto front\n(Yt) by a given algorithm [18]. R2 is a distance based metric that degenerates to the regret metric\npresented in the theoretical analysis.\n\n5.2 Results and Discussion\n\nWe run all experiments 10 times. The mean and variance of the P HV and R2 metrics across different\nruns are reported as a function of the number of iterations.\n\n7\n\n\fFigure 2: Results of different multi-objective BO algorithms including MESMO on real-world\nbenchmarks. The log of the hypervolume difference and R2 Indicator are shown with different\nnumber of function evaluations. The mean and variance of 10 different runs are plotted. The title of\neach \ufb01gure refers to the name of real-world benchmark. (Figures better seen in color.)\n\nMESMO vs. State-of-the-art. We evaluate the performance of MESMO and PESMO with different\nnumber of Monte-Carlo samples for acquisition function optimization. Figure 1 and Figure 2 show\nthe results of all multi-objective BO algorithms including MESMO for synthetic and real-world\nbenchmarks respectively. We present additional results of synthetic benchmarks in Figure 3 of\nthe Appendix. We make the following empirical observations: 1) MESMO consistently performs\nbetter than all baselines and also converges much faster. For blackbox optimization problems with\nexpensive function evaluations, faster convergence has practical bene\ufb01ts as it allows the end-user or\ndecision-maker to stop early. 2) Rate of convergence of MESMO slighly varies with different number\nof Monte-Carlo samples. However, in all cases, MESMO performs better than baseline methods. 3)\nThe convergence rate of PESMO is dramatically affected by the number of Monte-Carlo samples:\n100 samples lead to better results than 10 and 1. In contrast, MESMO maintains a better performance\nconsistently even with a single sample!. The results strongly demonstrate that MESMO is much\nmore robust to the number of Monte-Carlo samples than PESMO. 4) Performance of ParEGO is very\ninconsistent. In some cases, it is comparable to MESMO, but performs poorly on many other cases.\nThis is expected due to random scalarization.\nComparison of acquisition function optimization time. We compare the runtime of acquisition\nfunction optimization for different multi-objective BO algorithms including MESMO and PESMO\n(w/ different number of Monte-Carlo samples). We do not account for the time to \ufb01t GP models\nsince it is same for all the algorithms. We measure the average acquisition function optimization time\nacross all iterations. We run all experiments on a machine with the following con\ufb01guration: Intel\ni7-7700K CPU @ 4.20GHz with 8 cores and 32 GB memory. Table 1 shows the time in seconds two\nfor synthetic benchmarks. We present additional time comparison results in Figure 4 of the Appendix.\nWe \ufb01x the input space dimensions to d = 5 and vary the number of objective functions to show how\ndifferent algorithms scale with increasing number of objectives. We make the following observations:\n1) The acquisition function optimization time of MESMO is signi\ufb01cantly smaller than PESMO for\nthe same number of samples. The difference between corresponding times grow signi\ufb01cantly as the\nnumber of samples increase. 2) MESMO with one sample is comparable to ParEGO, which relies\non scalarization to reduce to acquisition function optimization in single-objective BO. 3) The time\n\n8\n\n\ffor PESMO and SMSego increases signi\ufb01cantly as the number of objectives grow from two to six,\nwhereas the corresponding growth in time is relatively small for MESMO.\n\nTable 1: Average acquisition function optimization time in seconds.\n\nMO Algorithm BC-2,2\nMESMO-1\nMESMO-10\nMESMO-100\nParEGO\n\n3.5\u00b10.34\n24.4\u00b15.75\n242.434\u00b1 8.9\n3.2\u00b1 1.6\n\nPRDZPS-6,6 MO Algorithm BC-2,2\n13.6\u00b13.2\n4.56\u00b10.71\n115.23\u00b117.1\n38.65\u00b1 0.65\n1128.3\u00b115.3\n377.53\u00b1 4.29\n5.3 \u00b1 2.3\n80.5\u00b1 2.1\n\nPESMO-1\nPESMO-10\nPESMO-100\nSMSego\n\nPRDZPS-6,6\n110.4\u00b117.8\n614.27\u00b144\n6092.96\u00b153.1\n300.43 \u00b1 35.7\n\n6 Summary and Future Work\n\nWe introduced a novel and principled approach referred as MESMO to solve multi-objective Bayesian\noptimization problems. The key idea is to employ an output space entropy based acquisition function\nto ef\ufb01ciently select inputs for evaluation. Our comprehensive experimental results on both synthetic\nand real-world benchmarks showed that MESMO yields consistently better results than state-of-the-\nart methods, and is more ef\ufb01cient and robust than methods based on input space entropy search.\nFuture work includes applying MESMO to solve novel engineering and scienti\ufb01c applications.\nAcknowledgements. The authors gratefully acknowledge the support from National Science Foun-\ndation (NSF) grants IIS-1845922 and OAC-1910213. The views expressed are those of the authors\nand do not re\ufb02ect the of\ufb01cial policy or position of the NSF.\n\nReferences\n[1] Oscar Almer, Nigel Topham, and Bj\u00f6rn Franke. A learning-based approach to the automated\ndesign of mpsoc networks. In International Conference on Architecture of Computing Systems,\npages 243\u2013258. Springer, 2011.\n\n[2] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley and Sons,\n\n2012.\n\n[3] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, T Meyarivan, and A Fast. Nsga-ii. IEEE\n\nTransactions on Evolutionary Computation, 6(2):182\u2013197, 2002.\n\n[4] Kalyanmoy Deb, Lothar Thiele, Marco Laumanns, and Eckart Zitzler. Scalable test problems\nfor evolutionary multiobjective optimization. In Evolutionary multiobjective optimization, pages\n105\u2013145. Springer, 2005.\n\n[5] Michael Emmerich and Jan-willem Klinkenberg. The computation of the expected improvement\nin dominated hypervolume of pareto front approximations. Technical Report, Leiden University,\n34, 2008.\n\n[6] Philipp Hennig and Christian J Schuler. Entropy search for information-ef\ufb01cient global opti-\n\nmization. Journal of Machine Learning Research (JMLR), 13(Jun):1809\u20131837, 2012.\n\n[7] Daniel Hern\u00e1ndez-Lobato, Jose Hernandez-Lobato, Amar Shah, and Ryan Adams. Predictive\nentropy search for multi-objective bayesian optimization. In Proceedings of International\nConference on Machine Learning (ICML), pages 1492\u20131501, 2016.\n\n[8] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive\nentropy search for ef\ufb01cient global optimization of black-box functions. In Advances in Neural\nInformation Processing Systems, pages 918\u2013926, 2014.\n\n[9] Matthew W Hoffman and Zoubin Ghahramani. Output-space predictive entropy search for\n\n\ufb02exible global optimization. In NIPS workshop on Bayesian Optimization, 2015.\n\n[10] Donald R Jones, Cary D Perttunen, and Bruce E Stuckman. Lipschitzian optimization without\nthe lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157\u2013181, 1993.\n\n9\n\n\f[11] Joshua Knowles. Parego: a hybrid algorithm with on-line landscape approximation for expen-\nsive multiobjective optimization problems. IEEE Transactions on Evolutionary Computation,\n10(1):50\u201366, 2006.\n\n[12] Lars Kotthoff, Chris Thornton, Holger H Hoos, Frank Hutter, and Kevin Leyton-Brown. Auto-\nweka 2.0: Automatic model selection and hyperparameter optimization in weka. Journal of\nMachine Learning Research (JMLR), 18(1):826\u2013830, 2017.\n\n[13] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[14] Joseph Victor Michalowicz, Jonathan M Nichols, and Frank Bucholtz. Handbook of differential\n\nentropy. Chapman and Hall/CRC, 2013.\n\n[15] Krzysztof Nowak, Marcus M\u00e4rtens, and Dario Izzo. Empirical performance of the approximation\nof the least hypervolume contributor. In International Conference on Parallel Problem Solving\nFrom Nature, pages 662\u2013671. Springer, 2014.\n\n[16] Tatsuya Okabe, Yaochu Jin, Markus Olhofer, and Bernhard Sendhoff. On test functions for\nevolutionary multi-objective optimization. In International Conference on Parallel Problem\nSolving from Nature, pages 792\u2013802. Springer, 2004.\n\n[17] Victor Picheny. Multi-objective optimization using gaussian process emulators via stepwise\n\nuncertainty reduction. Statistics and Computing, 25(6):1265\u20131280, 2015.\n\n[18] Victor Picheny, Tobias Wagner, and David Ginsbourger. A benchmark of kriging-based in\ufb01ll\ncriteria for noisy optimization. Structural and Multidisciplinary Optimization, 48(3):607\u2013626,\n2013.\n\n[19] Wolfgang Ponweiser, Tobias Wagner, Dirk Biermann, and Markus Vincze. Multiobjective\noptimization on a limited budget of evaluations using model-assisted s-metric selection. In\nInternational Conference on Parallel Problem Solving from Nature, pages 784\u2013794. Springer,\n2008.\n\n[20] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin Neural Information Processing Systems, pages 1177\u20131184, 2008.\n\n[21] Amar Shah and Zoubin Ghahramani. Pareto frontier learning with expensive correlated ob-\njectives. In Proceedings of International Conference on Machine Learning (ICML), pages\n1919\u20131927, 2016.\n\n[22] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking\nthe human out of the loop: A review of bayesian optimization. Proceedings of the IEEE,\n104(1):148\u2013175, 2016.\n\n[23] Norbert Siegmund, Sergiy S Kolesnikov, Christian K\u00e4stner, Sven Apel, Don Batory, Marko\nRosenm\u00fcller, and Gunter Saake. Predicting performance via automated feature-interaction\ndetection. In Proceedings of the 34th International Conference on Software Engineering (ICSE),\npages 167\u2013177, 2012.\n\n[24] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine\nlearning algorithms. In Advances in Neural Information Processing Systems, pages 2951\u20132959,\n2012.\n\n[25] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian pro-\ncess optimization in the bandit setting: No regret and experimental design. arXiv preprint\narXiv:0912.3995, 2009.\n\n[26] Zi Wang and Stefanie Jegelka. Max-value entropy search for ef\ufb01cient bayesian optimization. In\n\nProceedings of International Conference on Machine Learning (ICML), 2017.\n\n[27] Zi Wang, Bolei Zhou, and Stefanie Jegelka. Optimization as estimation with gaussian processes\nin bandit settings. In Proceedings of International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), pages 1022\u20131031, 2016.\n\n10\n\n\f[28] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning,\n\nvolume 2. MIT Press, 2006.\n\n[29] Eckart Zitzler. Evolutionary algorithms for multiobjective optimization: Methods and applica-\n\ntions, volume 63. Ithaca: Shaker, 1999.\n\n[30] Marcela Zuluaga, Peter Milder, and Markus P\u00fcschel. Computer generation of streaming sorting\nnetworks. In Proceedings of Design Automation Conference (DAC), pages 1241\u20131249, 2012.\n\n[31] Marcela Zuluaga, Guillaume Sergent, Andreas Krause, and Markus P\u00fcschel. Active learning for\nmulti-objective optimization. In Proceedings of International Conference on Machine Learning\n(ICML), pages 462\u2013470, 2013.\n\n11\n\n\f", "award": [], "sourceid": 4226, "authors": [{"given_name": "Syrine", "family_name": "Belakaria", "institution": "Washington State University"}, {"given_name": "Aryan", "family_name": "Deshwal", "institution": "Washington State University"}, {"given_name": "Janardhan Rao", "family_name": "Doppa", "institution": "Washington State University"}]}