{"title": "Multi-objective Bayesian optimisation with preferences over objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 12235, "page_last": 12245, "abstract": "We present a multi-objective Bayesian optimisation algorithm that allows the user to express preference-order constraints on the objectives of the type objective A is more important than objective B. These preferences are defined based on the stability of the obtained solutions with respect to preferred objective functions. Rather than attempting to find a representative subset of the complete Pareto front, our algorithm selects those Pareto-optimal points that satisfy these constraints. We formulate a new acquisition function based on expected improvement in dominated hypervolume (EHI) to ensure that the subset of Pareto front satisfying the constraints is thoroughly explored. The hypervolume calculation is weighted by the probability of a point satisfying the constraints from a gradient Gaussian Process model. We demonstrate our algorithm on both synthetic and real-world problems.", "full_text": "Multi-objective Bayesian optimisation with\n\npreferences over objectives\n\nMajid Abdolshah, Alistair Shilton, Santu Rana, Sunil Gupta, Svetha Venkatesh\n\nThe Applied Arti\ufb01cial Intelligence Institute (A2I2),\n\nDeakin University, Australia\n\n{majid,alistair.shilton,santu.rana,sunil.gupta,svetha.venkatesh}\n\n@deakin.edu.au\n\nAbstract\n\nWe present a multi-objective Bayesian optimisation algorithm that allows the user\nto express preference-order constraints on the objectives of the type \u201cobjective A\nis more important than objective B\u201d. These preferences are de\ufb01ned based on the\nstability of the obtained solutions with respect to preferred objective functions.\nRather than attempting to \ufb01nd a representative subset of the complete Pareto front,\nour algorithm selects those Pareto-optimal points that satisfy these constraints. We\nformulate a new acquisition function based on expected improvement in dominated\nhypervolume (EHI) to ensure that the subset of Pareto front satisfying the con-\nstraints is thoroughly explored. The hypervolume calculation is weighted by the\nprobability of a point satisfying the constraints from a gradient Gaussian Process\nmodel. We demonstrate our algorithm on both synthetic and real-world problems.\n\n1\n\nIntroduction\n\nIn many real world problems, practitioners are required to sequentially evaluate a noisy black-box and\nexpensive to evaluate function f with the goal of \ufb01nding its optimum in some domain X. Bayesian\noptimisation is a well-known algorithm for such problems. There are a variety of studies such as\nhyperparameter tuning [27, 13, 12], expensive multi-objective optimisation for Robotics [2, 1], and\nexperimentation optimisation in product design such as short polymer \ufb01ber materials [16].\nMulti-objective Bayesian optimisation involves at least two con\ufb02icting, black-box, and expensive to\nevaluate objectives to be optimised simultaneously. Multi-objective optimisation usually assumes\nthat all objectives are equally important, and solutions are found by seeking the Pareto front in the\nobjective space [4, 5, 3]. However, in most cases, users can stipulate preferences over objectives. This\ninformation will impart on the relative importance on sections of the Pareto front. Thus using this\ninformation to preferentially sample the Pareto front will boost the ef\ufb01ciency of the optimiser, which\nis particularly advantageous when the objective functions are expensive.\nIn this study, preferences over objectives are stipulated based on the stability of the solutions with\nrespect to a set of objective functions. As an example, there are scenarios when investment strategists\nare looking for Pareto optimal investment strategies that prefer stable solutions for return (objective\n1) but more diverse solutions with respect to risk (objective 2) as they can later decide their appetite\nfor risk. As can be inferred, the stability in one objective produces more diverse solutions for the\nother objectives. We believe in many real-world problems our proposed method can be useful in order\nto reduce the cost, and improve the safety of experimental design.\nWhilst multi-objective Bayesian optimisation for sample ef\ufb01cient discovery of Pareto front is an\nestablished research track [9, 18, 8, 15], limited work has examined the incorporation of preferences.\nRecently, there has been a study [18] wherein given a user speci\ufb01ed preferred region in objective space,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\nand \u2202f1(x)\n\n\u2202x\n\n\u2202x\n\nFigure 1: (a) Local Pareto optimality for 2 objective example with 1D design space. Local optimality\nimplies \u2202f0(x)\nhave opposite signs since the weighted sum of gradients of the objectives\n\u2202x || >\nwith respect to x must be zero: sT \u2202\n\u2202x ||, so perturbation of x will cause relatively more change in f1 than f0 - i.e. such solutions\n|| \u2202f0(x)\n\u2202x || favoring\nare (relatively) stable in objective f0. (c) Shows the converse, namely || \u2202f0(x)\nsolutions that are (relatively) stable in objective f1 and diverse in f0.\n\n\u2202x f (x) = 0. In (b) we additionally require that || \u2202f1(x)\n\n\u2202x || > || \u2202f1(x)\n\nthe optimiser focuses its sampling to derive the Pareto front ef\ufb01ciently. However, such preferences\nare based on the assumption of having an accurate prior knowledge about objective space and the\npreferred region (generally a hyperbox) for Pareto front solutions. The main contribution of this study\nis formulating the concept of preference-order constraints and incorporating that into a multi-objective\nBayesian optimisation framework to address the unavailability of prior knowledge and boosting the\nperformance of optimisation in such scenarios.\nWe are formulating the preference-order constraints through ordering of derivatives and incorporating\nthat into multi-objective optimisation using the geometry of the constraints space whilst needing\nno prior information about the functions. Formally, we \ufb01nd a representative set of Pareto-optimal\nsolutions to the following multi-objective optimisation problem:\n\nD(cid:63) \u2282 X(cid:63) = argmax\nx\u2208X\n\nf (x)\n\n(1)\n\nsubject to preference-order constraints - that is, assuming f = [f0, f1, . . . , fm], f0 is more important\n(in terms of stability) than f1 and so on. Our algorithm aims to maximise the dominated hypervolume\nof the solution in a way that the solutions that meet the constraints are given more weights.\nTo formalise the concept of preference-order constraints, we \ufb01rst note that a point is locally Pareto\noptimal if any suf\ufb01ciently small perturbation of a single design parameter of that point does not\nsimultaneously increase (or decrease) all objectives. Thus, equivalently, a point is locally Pareto\noptimal if we can de\ufb01ne a set of weight vectors such that, for each design parameter, the weighted sum\nof gradients of the objectives with respect to that design parameter is zero (see Figure 1a). Therefore,\nthe weight vectors de\ufb01ne the relative importance of each objective at that point. Figure 1b illustrates\nthis concept where the blue box de\ufb01nes the region of stability for the function f0. Since in this section\nthe magnitude of partial derivative for f0 is smaller compared to that of f1, the weights required to\nsatisfy Pareto optimality would need higher weight corresponding to the gradient of f0 compared to\nthat of f1 (see Figure 1b). Conversely, in Figure 1c, the red box highlights the section of the Pareto\nfront where solutions have high stability in f1. To obtain samples from this section of the Pareto front,\nwe need to make the weights corresponding to the gradient of f0 to be smaller to that of the f1.\nOur solution is based on understanding the geometry of the constraints in the weight space. We show\nthat preference order constraints gives rise to a polyhedral proper cone in this space. We show that\nfor the pareto-optimality condition, it necessitates the gradients of the objectives at pareto-optimal\npoints to lie in a perpendicular cone to that polyhedral. We then quantify the posterior probability\nthat any point satis\ufb01es the preference-order constraints given a set of observations. We show how\nthese posterior probabilities may be incorporated into the EHI acquisition function [11] to steer the\nBayesian optimiser toward Pareto optimal points that satisfy the preference-order constraint and away\nfrom those that do not.\n\n2\n\n01\ud835\udf151\ud835\udf15\ud835\udc65\ud835\udf150\ud835\udf15\ud835\udc65\ud835\udf150\ud835\udf15\ud835\udc65\ud835\udf151\ud835\udf15\ud835\udc65OR01\ud835\udf151\ud835\udf15\ud835\udc65\ud835\udf150\ud835\udf15\ud835\udc65\ud835\udf150\ud835\udf15\ud835\udc65\ud835\udf151\ud835\udf15\ud835\udc65\ud835\udc600>\ud835\udc601OR01\ud835\udf150\ud835\udf15\ud835\udc65\ud835\udf151\ud835\udf15\ud835\udc65\ud835\udf151\ud835\udf15\ud835\udc65\ud835\udf150\ud835\udf15\ud835\udc65OR\ud835\udc601>\ud835\udc600\f2 Notation\nSets are written A, B, C, . . . where R+ is the positive reals, \u00afR+ = R+ \u222a {0}, Z+ = {1, 2, . . .},\nand Zn = {0, 1, . . . , n \u2212 1}. |A| is the cardinality of the set A. Tuples (ordered sets) are denoted\nA, B, C, . . .. Distributions are denoted A,B,C, . . .. column vectors are bold lower case a, b, c, . . ..\nMatrices bold upper case A, B, C, . . .. Element i of vector a is ai, and element i, j of matrix A is\nAi,j (all indexed i, j = 0, 1, . . .). The transpose is denoted aT, AT. I is the identity matrix, 1 is a\nvector of 1s, 0 is a vector of 0s, and ei is a vector e(i)j = \u03b4ij, where \u03b4ij is the Kronecker-Delta.\n\u2207x = [ \u2202\n]T, sgn(x) is the sign of x (where sgn(0) = 0), and the indicator function\n. . .\nis denoted as 1(A).\n\n\u2202xn\u22121\n\n\u2202x0\n\n\u2202x1\n\n\u2202\n\n\u2202\n\n3 Background\n\n3.1 Gaussian Processes\nLet X \u2282 Rn be compact. A Gaussian process [23] GP(\u00b5, K) is a distribution on the function\nspace f : X \u2192 R de\ufb01ned by mean \u00b5 : X \u2192 R (assumed zero without loss of generality) and\nkernel (covariance) K : X \u00d7 X \u2192 R. If f (x) \u223c GP(0, K(x, x(cid:48))) then the posterior of f given D =\n{(x(j), y(j)) \u2208 Rn\u00d7R|y(j) = f (x(j))+\u0001, \u0001 \u223c N (0, \u03c32), j \u2208 ZN}, f (x)|D \u223c N (\u00b5D(x), \u03c3D(x, x(cid:48))),\nwhere:\n\n\u00b5D (x) = kT (x)(cid:0)K + \u03c32I(cid:1)\u22121\n\u03c3D (x, x(cid:48)) = K (x, x(cid:48)) \u2212 kT (x)(cid:0)K + \u03c32I(cid:1)\u22121\n\nk (x(cid:48))\nand y, k(x) \u2208 R|D|, K \u2208 R|D|\u00d7|D|, k(x)j = K(x, x(j)), Kjk = K(x(j), x(k)).\nSince differentiation is a linear operation, the derivative of a Gaussian process is also a Gaussian\nprocess [17, 22]. The posterior of \u2207xf given D is \u2207xf (x)|D \u223c N (\u00b5(cid:48)D(x), \u03c3(cid:48)D(x, x(cid:48))), where:\n\n(2)\n\ny\n\n\u00b5(cid:48)D (x) =(cid:0)\u2207xkT (x)(cid:1)(cid:0)K + \u03c32I(cid:1)\u22121\n\n\u03c3(cid:48)D (x, x(cid:48)) = \u2207x\u2207T\n\nx(cid:48)K (x, x(cid:48)) \u2212(cid:0)\u2207xkT (x)(cid:1) (K + \u03c32\n\ni I)\u22121(cid:0)\u2207x(cid:48)kT (x(cid:48))(cid:1)T\n\ny\n\n(3)\n\n3.2 Multi-Objective Optimisation\n\nA multi-objective optimisation problem has the form:\n\nargmax\n\nf (x)\n\nx\u2208X\n\n(4)\nwhere the components of f : X \u2282 Rn \u2192 Y \u2282 Rm represent the m distinct objectives fi : X \u2192 R.\nX and Y are called design space and objective space, respectively. A Pareto-optimal solution is\na point x(cid:63) \u2208 X for which it is not possible to \ufb01nd another solution x \u2208 X such that fi(x) >\nfi(x(cid:63)) for all objectives f0, f1, . . . fm\u22121. The set of all Pareto optimal solutions is the Pareto set\nX(cid:63) = { x(cid:63) \u2208 X| (cid:64)x \u2208 X : f (x) (cid:31) f (x(cid:63))} where y (cid:31) y(cid:48) (y dominates y(cid:48)) means y (cid:54)= y(cid:48), yi \u2265 y(cid:48)\n\u2200i, and y (cid:23) y(cid:48) means y (cid:31) y(cid:48) or y = y(cid:48).\ni\nGiven observations D = {(x(j), y(j)) \u2208 Rn \u00d7 Rm|y(j) = f (x(j)) + \u0001, \u0001i \u223c N (0, \u03c32\ni )}\nof f the dominant set D\u2217 = { (x\u2217, y\u2217) \u2208 D| (cid:64) (x, y) \u2208 D : y (cid:23) y\u2217} is the most optimal sub-\nset of D (in the Pareto sense). The \u201cgoodness\u201d of D is often measured by the domi-\nto some reference point z \u2208 Rm:\nnated hypervolume (S-metric, [31, 10]) with respect\n\ny\u2265z 1(cid:0)\u2203y(i) \u2208 D(cid:12)(cid:12) y(i) (cid:23) y(cid:1) dy. Thus our aim is to \ufb01nd the set D that max-\n\nimises the hypervolume. Optimised algorithms exist for calculating hypervolume [29, 25], S(D),\nwhich is typically calculated by sorting the dominant observations along each axis in objective space\nto form a grid. Dominated hypervolume (with respect to z) is then the sum of the hypervolumes of\n\nS (D) = S (D\u2217) =(cid:82)\nthe dominated cells (ck) - i.e. S (D) =(cid:80)\n\nk vol (ck) .\n\n3.3 Bayesian Multi-Objective Optimisation\n\nIn the multi-objective case one typically assumes that the components of f are draws from independent\nGaussian processes, i.e. fi(x) \u223c GP(0, K(i)(x, x(cid:48))), and fi and fi(cid:48) are independent \u2200i (cid:54)= i(cid:48). A\n\n3\n\n\fpopular acquisition function for multi-objective Bayesian optimisation is expected hypervolume\nimprovement (EHI). The EHI acquisition function is de\ufb01ned by:\n\nat ( x| D) = Ef (x)|D [S (D \u222a {(x, f (x))}) \u2212 S (D)]\n\n(5)\n\n[26, 30] and represents the expected change in the dominated hypervolume by the set of observations\nbased on the posterior Gaussian process.\n\n4 Problem Formulation\nLet f : X \u2282 Rn \u2192 Y \u2282 Rm be a vector of m independent draws fi \u223c GP(0, K(i)(x, x)) from zero-\nmean Gaussian processes. Assume that f is expensive to evaluate. Our aim is to \ufb01nd a representative\nset of Pareto-optimal solutions to the following multi-objective optimisation problem:\n\nD(cid:63) \u2282 X(cid:63) = argmax\nx\u2208XI\u2282X\n\nf (x)\n\n(6)\n\nsubject to preference-order constraints. Speci\ufb01cally, we want to explore only that subset of solutions\nXI \u2282 X that place more importance on one objective fi0 than objective fi1, and so on, as speci\ufb01ed\nby the (ordered) preference tuple I = (i0, i1, . . . iQ|{i0, i1, . . .} \u2282 Zm, ik (cid:54)= ik(cid:48)\u2200k (cid:54)= k(cid:48)), where\nQ \u2208 Zm is the number of de\ufb01ned preferences over objectives.\n\n(cid:1) f (x(cid:63)) /\u2208 Rm\n\nequivalently (cid:0)\u03b4xT\u2207x\n\n4.1 Preference-Order Constraints\nLet x(cid:63) \u2208 int(X)\u2229X(cid:63) be a Pareto-optimal point in the interior of X. Necessary (but not suf\ufb01cient, local)\nPareto optimality conditions require that, for all suf\ufb01ciently small \u03b4x \u2208 Rn, f (x(cid:63) + \u03b4x) (cid:7) f (x), or,\n+ . A necessary (again not suf\ufb01cient) equivalent condition is that,\nfor each axis j \u2208 Zn in design space, suf\ufb01ciently small changes in xj do not cause all objectives to\nsimultaneously increase (and/or remain unchanged) or decrease (and/or remain unchanged). Failure of\nthis condition would indicate that simply changing design parameter xj could improve all objectives,\nand hence that x(cid:63) was not in fact Pareto optimal. In summary, local Pareto optimality requires that\n\u2200j \u2208 Zn there exists s(j) \u2208 \u00afRm\n\n+\\{0} such that:\n\nsT\n(j)\n\n\u2202\n\u2202xj\n\nf (x) = 0\n\n(7)\n\nIt is important to note that this is not the same as the optimality conditions that may be derived from\nlinear scalarisation, as the optimality conditions that arrise from linear scalarisation additionally\nrequire that s(0) = s(1) = . . . = s(n\u22121). Moreover (7) applies to all Pareto-optimal points, whereas\nlinear scalarisation optimisation conditions fail for Pareto points on non-convex regions [28].\nDe\ufb01nition 1 (Preference-Order Constraints) Let I = (i0, i1, . . . iQ|{i0, i1, . . .} \u2282 Zm, ik (cid:54)=\nik(cid:48)\u2200k (cid:54)= k(cid:48)) be an (ordered) preference tuple. A vector x \u2208 X satis\ufb01es the associated preference-order\nconstraint if \u2203s(0), s(1), . . . , s(n\u22121) \u2208 SI such that:\n\nwhere SI (cid:44)(cid:8) s \u2208 \u00afRm\n\n\u2202\n\u2202xj\n\n+\\{0}(cid:12)(cid:12) si0 \u2265 si1 \u2265 si2 \u2265 . . .(cid:9) . Further we de\ufb01ne XI to be the set of all\n\nf (x) = 0 \u2200j \u2208 Zn\n\nsT\n(j)\n\nx \u2208 X satisfying the preference-order constraint. Equivalently:\n\n(cid:44)(cid:8) x \u2208 X|\u2203s \u2208 SI, sTx = 0(cid:9) .\n\nXI = {x \u2208 X| \u2202\n\n\u2202xj\n\nwhere S\u22a5\n\nI\n\nf (x) \u2208 S\u22a5\n\nI \u2200j \u2208 Zn}\n\nIt is noteworthy to mention that (7) and De\ufb01nition 1 are the key for calculating the compliance of\na recommended solution with the preference-order constraints. Having de\ufb01ned preference-order\nconstraints we then calculate the posterior probability that x \u2208 XI, and showing how these posterior\nprobabilities may be incorporated into the EHI acquisition function to steer the Bayesian optimiser\ntoward Pareto optimal points that satisfy the preference-order constraint. Before proceeding, however,\nit is necessary to brie\ufb02y consider the geometry of SI and S\u22a5\nI .\n\n4\n\n\fI , SI and the vectors a(0), a(1) for a 2D case where I = (0, 1), so s0 > s1,\nI is the union of two sub-spaces.\nI implies a solution complying with preference-order constraints. b0 and b1 are the projection\nI , it is necessary that \u2203s \u2208 SI s.t. vT s = 0 or\n\nFigure 2: Illustration of S\u22a5\nSI is a proper cone representing the preference-order constraints; S\u22a5\nv \u2208 S\u22a5\nof v over \u02dca(0) and \u02dca(1). In order to satisfy v \u2208 S\u22a5\nequivalently v = 0 or b0 = \u02dcaT\n\n(1)v have different signs.\n\n(0)v and b1 = \u02dcaT\n\n4.2 The geometry of SI and S\u22a5\n\nI\n\nIn the following we assume, w.l.o.g, that the preference-order constraints follows the order of indices\nin objective functions (reorder, otherwise), and that there is at least one constraint.\nWe now de\ufb01ne the preference-order constraints by assumption I = (0, 1, . . . , Q|Q \u2208 Zm\\{0}),\nwhere Q > 0. This de\ufb01nes the sets SI and S\u22a5\nI , which in turn de\ufb01ne the constraints that must be met\nf (x) = 0 \u2200j \u2208 Zn\nby the gradients of f (x) - either \u2203s(0), s(1), . . . , s(n\u22121) \u2208 SI such that sT\nor, equivalently\n\nI \u2200j \u2208 Zn. Next, Theorem 1 de\ufb01nes the representation of SI.\n\nf (x) \u2208 S\u22a5\n\n\u2202\n\u2202xj\n\n(j)\n\n\u2202\n\u2202xj\n\nTheorem 1 Let I = (0, 1, . . . , Q|Q \u2208 Zm\\{0}) be an (ordered) preference tuple. De\ufb01ne SI as per\nde\ufb01nition 1. Then SI is a polyhedral (\ufb01nitely-generated) proper cone (excluding the origin) that may\nbe represented using either a polyhedral representation:\n\n(cid:110)\n\ns \u2208 Rm| aT\n\n(i)s \u2265 0\u2200i \u2208 Zm\n\n(cid:12)(cid:12) c \u2208 \u00afRm\n\n+\n\nci\u02dca(i)\n\n(cid:111)\\{0}\n(cid:111)\\{0}\n\nSI =\n\nor a generative representation:\n\nwhere \u2200i \u2208 Zm:\n\nSI =\n\na(i) =\n\n\u02dca(i) =\n\ni\u2208Zm\n\n(cid:110) (cid:80)\n(cid:26) 1\u221a\n(cid:40) 1\u221a\n\n2\nei\n\nei\n\n(cid:80)\n\n(ei \u2212 ei+1) if i \u2208 ZQ\notherwise\nif i \u2208 ZQ+1\notherwise\n\nl\u2208Zi+1\n\nel\n\ni+1\n\n(8)\n\n(9)\n\nand e0, e1, . . . , em\u22121 are the Euclidean basis of Rm.\n\nProof of Theorem 1 is available in the supplementary material. To test if a point satis\ufb01es this require-\nment we need to understand the geometry of the set SI. The Theorem 1 shows that SI\u222a{0} is a polyhe-\ndral (\ufb01nitely generated) proper cone, represented either in terms of half-space constraints (polyhedral\nform) or as a positive span of extreme directions (generative representation). The geometrical intuition\nfor this is given in Figure 2 for a simple, 2-objective case with a single preference order constraint.\n\n5\n\nS\u2111s1S\u2111\u22a5~a(0)S\u2111\u22a5a(0)a(1)~a(1)s0Ve0e1~a(0)b1b0~a(1)~a(0)V\fAlgorithm 1 Test if v \u2208 S\u22a5\nI .\nInput: Preference tuple I\nTest vector v \u2208 Rm.\nOutput: 1(v \u2208 S\u22a5\nI ).\n// Calculate 1(v \u2208 S\u22a5\nI ).\n(j)v \u2200j \u2208 Zm.\nLet bj = \u02dcaT\nif \u2203i (cid:54)= k \u2208 Zm : sgn(bi) (cid:54)= sgn(bk) return\nTRUE\nelseif b = 0 return TRUE\nelse return FALSE.\n\nAlgorithm 2 Preference-Order Constrained\nBayesian Optimisation (MOBO-PC).\nInput: preference-order tuple I.\nObservations D = {(x(i), y(i)) \u2208 X \u00d7 Y}.\nfor t = 0, 1, . . . , T \u2212 1 do\n\nSelect the test point:\naPEHI\nx = argmax\nt\n\n(x|Dt).\n\nx\u2208X\nis evaluated using algorithm 4).\n\n(aPEHI\nPerform Experiment y = f (x) + \u0001.\nUpdate Dt+1 := Dt \u222a {(x, y)}.\n\nt\n\nend for\n\nAlgorithm 3 Calculate Pr(x \u2208 XI|D).\n\nAlgorithm 4 Calculate aPEHI\n\nt\n\n(x|D).\n\nInput: Observations D = {(x(i), y(i)) \u2208\nX \u00d7 Y}.\nNumber of Monte Carlo samples R.\nTest vector x \u2208 X.\nOutput: Pr(x \u2208 XI|D).\nLet q = 0.\nfor k = 0, 1, . . . , R \u2212 1 do\n//Construct samples\nv(0), v(1), . . . , v(n\u22121) \u2208 Rm.\nLet v(j) = 0 \u2200j \u2208 Zn.\nfor i = 0, 1, . . . , m \u2212 1 do\n\nSample u \u223c N (\u00b5(cid:48)Di(x), \u03c3(cid:48)Di(x, x))\n(see (3)).\nLet [v(0)i, v(1)i, . . . , v(n\u22121)i] := uT.\n\nend for\n//T est if v(j) \u2208 S\u22a5\n\nLet q := q + (cid:81)\n\nI \u2200j \u2208 Zn.\n1(v(j) \u2208 S\u22a5\n\nj\u2208Zn\n\nI ) (see algo\n\nrithm 1).\n\nend for\nReturn q\nR .\n\nInput: Observations D = {(x(i), y(i)) \u2208\nX \u00d7 Y}.\nNumber of Monte Carlo samples \u02dcR.\nTest vector x \u2208 X.\nOutput: aPEHI\nUsing algorithm 3, calculate:\n\n(x|D).\nsx = Pr ( x \u2208 XI| D)\n\ns(j) = Pr(cid:0) x(j) \u2208 XI\n\n(cid:12)(cid:12) D(cid:1) \u2200(cid:0)x(j), y(j)\n\n(cid:1) \u2208 D\n\nt\n\nLet q = 0.\nfor k = 0, 1, . . . , \u02dcR \u2212 1 do\n\nSample yi \u223c N (\u00b5Di(x), \u03c3Di(x))) \u2200i \u2208\nZm (see (2)).\nConstruct cells c0, c1, . . . from D\u222a\n{(x, y)} by sorting along each axis in\nobjective space to form a grid.\nCalculate:\nq = q+\n\n(cid:80)\n\nvol (ck)\n\nj\u2208ZN :y(j)(cid:23)\u02dcyck\n\n(cid:81)\n\n(cid:0)1 \u2212 s(j)\n\n(cid:1)\n\nsx\n\nk:y(cid:23)\u02dcyck\n\nend for\nReturn q/ \u02dcR.\n\nI . We will use this algorithm to test if \u2202\n\u2202xj\n\nThe subsequent corollary allows us to construct a simple algorithm (algorithm 1) to test if a vector v\nI \u2200j \u2208 Zn - that is, if x satis\ufb01es\nlies in the set S\u22a5\nthe preference-order constraints. The proof of corollary 1 is available in the supplementary material.\nCorollary 1 Let I = (0, 1, . . . , Q|Q \u2208 Zm\\{0}) be an (ordered) preference tuple. De\ufb01ne S\u22a5\nI as per\nI if and only if v = 0 or \u2203i (cid:54)= k \u2208 Zm such that\nde\ufb01nition 1. Using the notation of Theorem 1, v \u2208 S\u22a5\nsgn(\u02dcaT\n\nf (x) \u2208 S\u22a5\n\n(k)v), where sgn(0) = 0.\n\n(i)v) (cid:54)= sgn(\u02dcaT\n\n5 Preference Constrained Bayesian Optimisation\n\nIn this section we do two things. First, we show how the Gaussian process models of the objectives\nfi (and their derivatives) may be used to calculate the posterior probability that x \u2208 XI de\ufb01ned\nby I = (0, 1, . . . , Q|Q \u2208 Zm\\{0}). Second, we show how the EHI acquisition function may be\nmodi\ufb01ed and calculated to incorporate these probabilities and hence only reward points that satisfy\nthe preference-order conditions. Finally, we give our algorithm using this acquisition function.\n\n5.1 Calculating Posterior Probabilities\nGiven that fi \u223c GP(0, K(i)(x, x)) are draws from independent Gaussian processes, and\ngiven observations D, we wish to calculate the posterior probability that x \u2208 XI -\n\n6\n\n\fi.e.: Pr ( x \u2208 XI| D) = Pr\n\u2207xfi(x)|D \u223c Ni (cid:44) N (\u00b5(cid:48)Di(x), \u03c3(cid:48)Di(x, x(cid:48))), as de\ufb01ned by (3). Hence:\n\nI \u2200j \u2208 Zn\n\nf (x) \u2208 S\u22a5\n\n\u2202xj\n\n. As fi \u223c GP(0, K(i)(x, x)) it follows that\n\nPr ( x \u2208 XI| D) = Pr\n\n(cid:16) \u2202\n\n(cid:17)\n\uf8eb\uf8ec\uf8ec\uf8edv(j) \u2208 S\u22a5\n\n\u2200j \u2208 Zn\n\nI\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\uf8ee\uf8ef\uf8ef\uf8f0 v(0)i\n\nv(1)i\n...\n\nv(n\u22121)i\n\n\uf8f9\uf8fa\uf8fa\uf8fb\u223c Ni\u2200i \u2208 Zm\n\n\uf8f6\uf8f7\uf8f7\uf8f8\n\nwhere v \u223c P (\u2207xf|D). We estimate it using Monte-Carlo [6] sampling as per algorithm 3.\n\n5.2 Preference-Order Constrained Bayesian Optimisation Algorithm (MOBO-PC)\n\nOur complete Bayesian optimisation algorithm with Preference-order constraints is given in algorithm\n2. The acquisition function introduced in this algorithm gives higher importance to points satisfying\nthe preference-order constraints. Unlike standard EHI, we take expectation over both the expected\nexperimental outcomes fi(x) \u223c N (\u00b5Di(x), \u03c3Di(x, x)), \u2200i \u2208 Zm, and the probability that points\nx(i) \u2208 XI and x \u2208 XI satisfy the preference-order constraints. We de\ufb01ne our preference-based EHI\nacquisition function as:\n\naPEHI\nt\n\n( x| D) = E [ SI (D \u222a {(x, f (x))}) \u2212 SI (D)| D]\n\n(10)\nwhere SI(D) is the hypervolume dominated by the observations (x, y) \u2208 D satisfying the\npreference-order constraints. The calculation of SI(D) is illustrated in the supplementary material.\nThe expectation of SI(D) given D is:\n\nE [ SI (D)| D] =(cid:80)\n=(cid:80)\n\nvol (ck) (1 \u2212\n\nk\n\nk\n\n(cid:81)\n\n(x,y)\u2208D:y(cid:23)\u02dcyck\n\nvol (ck) Pr(\u2203 (x, y)\u2208D| y(cid:23) \u02dcyck \u2227 . . . x \u2208 XI) . . .\n\n(1 \u2212 Pr ( x \u2208 XI| D)))\n\nwhere \u02dcyck is the dominant corner of cell ck, vol(ck) is the hypervolume of cell ck, and the cells\nck are constructed by sorting D along each axis in objective space. The posterior probabilities\nPr(x \u2208 XI|D) are calculated using algorithm 3. It follows that:\n\n( x| D) = Pr ( x \u2208 XI| D) E(cid:104) (cid:80)\n\naPEHI\nt\n\n(cid:81)\n\n(cid:0)1 \u2212 Pr(cid:0) x(j) \u2208 XI\n\n(cid:12)(cid:12) D(cid:1)(cid:1)(cid:12)(cid:12)(cid:12)yi \u223c . . .\n(cid:105)\n\nN (\u00b5Di (x) , \u03c3Di (x)) \u2200i \u2208 Zm\nwhere the cells ck are constructed using the set D \u222a {(x, y)} by sorting along the axis in objective\nspace.We estimate this acquisition function using Monte-Carlo simulation shown in algorithm 4.\n\nvol (ck)\n\nk:y(cid:23)\u02dcyck\n\nj\u2208ZN :y(j)(cid:23)\u02dcyck\n\n6 Experiments\n\nWe conduct a series of experiments to test the empirical performance of our proposed method\nMOBO-PC and compare with other strategies. These experiments including synthetic data as well as\noptimizing the hyper-parameters of a feed-forward neural network. For Gaussian process, we use\nmaximum likelihood estimation for setting hyperparameters [21].\n\n6.1 Baselines\n\nTo the best of our knowledge there are no studies aiming to solve our proposed problem, however we\nare using PESMO, SMSego, SUR, ParEGO and EHI [9, 20, 19, 14, 7] to con\ufb01rm the validity of the\nobtained Pareto front solutions. The obtained Pareto front must be in the ground-truth whilst also\nsatisfying the preference-order constraints. We compare our results with MOBO-RS [18] by suitably\nspecifying bounding boxes in the objective space that can replicate a preference-order constraint.\n\n6.2 Synthetic Functions\n\nWe begin with a comparison on minimising synthetic function Schaffer function N. 1 with 2 con\ufb02icting\nobjectives f0, f1 and 1-dimensional input. (see [24]). Figure 3a shows the ground-truth Pareto front\n\n7\n\n\f(a) Full Pareto front\n\n(b) Case 1, s0 \u2248 s1\n\n(c) Case 2, s0 < s1\n\n(d) Case 3, s0 > s1\n\n(e)\n\n(f)\n\nFigure 3: Finding Pareto front which comply with the preference-order constraint. Figure 3a shows\nthe full Pareto front solution (with no preferences). Figure 3b illustrates the Pareto front by assuming\nstability of \ufb01rst objective f0 is similar to second objective f1. In Figure 3c, stability of f1 is preferred\nover f0. Figure 3d shows more stable results for f0 than f1 (s0 > s1). Figure 3e and 3f shows the\nresults obtained by MOBO-RS and the corresponding bounding boxes. The gradient color of the\nPareto front points in Figure 3b-3d indicates their degree of compliance with the constraints.\n\nconstraints SI (cid:44)(cid:8) s \u2208 \u00afRm\n\n+\\{0}(cid:12)(cid:12) s0 \u2248 s1\n\nfor this function. To illustrate the behavior of our method, we impose distinct preferences. Three test\ncases are designed to illustrate the effects of imposing preference-order constraints on the objective\nfunctions for stability. Case (1): s0 \u2248 s1, Case (2): s0 < s1 and Case (3): s0 > s1. For our method\nit is only required to de\ufb01ne the preference-order constraints, however for MOBO-RS, additional\ninformation as a bounding box is obligatory. Figure 3b (case 1), shows the results of preference-order\n\n(cid:9) for our proposed method, where s0 represents the\n\nimportance of stability in minimising f0 and s1 is the importance of stability in minimising f1. Due to\nsame importance of both objectives, a balanced optimisation is expected. Higher weights are obtained\nfor the Pareto front points in the middle region with highest stability for both objectives. Figure\n3c (case 2) is based on the preference-order of s0 < s1 that implies the importance of stability in\nf1 is more than f0. The results show more stable Pareto points for f1 than f0. Figure 3d (case 3)\nshows the results of s0 > s1 preference-order constraint. As expected, we see more number of stable\nPareto points for the important objective (i.e. f0 in this case). We de\ufb01ned two bounding boxes for\nMOBO-RS approach which can represent the preference-order constraints in our approach (Figure\n3e and 3f). There are in\ufb01nite possible bounding boxes can serve as constraints on objectives in\nsuch problems, consequently, the instability of results is expected across the various de\ufb01nitions of\nbounding boxes. We believe our method can obtain more stable Pareto front solutions especially\nwhen prior information is sparse. Also, having extra information as the weight (importance) of the\nPareto front points is another advantage.\nFigure 4 illustrates a special test case in which s0 > s1 and s2 > s1, yet no preferences speci\ufb01ed over\nf2 and f0 while minimising Viennet function. The proposed complex preference-order constraint\ndoes not form a proper cone as elaborated in Theorem 1. However, s0 > s1 independently constructs\na proper cone, likewise for s2 > s1. Figure 4a shows the results of processing these two independent\nconstraints separately, merging their results and \ufb01nding the Pareto front. Figure 4b implies more\nstable solutions for f0 comparing to f1. Figure 4c shows the Pareto front points comply with s2 > s1.\n\n8\n\n024f0024f1FullPareto024f001234f1MOBO\u2212PC30Weight024f001234f1MOBO\u2212PC30Weight024f001234f1MOBO\u2212PC30Weight01234f001234f1MOBO\u2212RS3001234f001234f1MOBO\u2212RS30\f(a) Obtained Pareto points\n\n(b) Projection of Pareto points in f0\nand f1 space\n\n(c) Projection of Pareto points in f1\nand f2 space\n\nFigure 4: Finding Pareto front points with partial constraints as speci\ufb01ed by s0 > s1 and s2 > s1.\nFigure 4a shows the 3D plot of the obtained Pareto front points satisfying preference-order constraints\nwith the color indicating the degree of compliance. Figure 4b illustrates the projection of Pareto\noptimal points on f0 \u00d7 f1 sub-space, and \ufb01gure 4c shows the projection on f1 \u00d7 f2 sub-space.\n\nFigure 5: Average Pareto fronts obtained by proposed method in comparison to other methods. This\nexperiment de\ufb01nes s1 > s0 i.e. stability of run time is more important than the error. For MOBO-RS,\n[[0.02, 0], [0.03, 2]] is an additional information used as bounding box. The other methods do not\nincorporate preferences. The results are shown for 100 evaluations of the objectives (left) and 200\nevaluations of the objectives (right).\n6.3 Finding a Fast and Accurate Neural Network\nNext, we train a neural network with two objectives of minimising both prediction error and prediction\ntime, as per [9]. These are con\ufb02icting objectives because reducing the prediction error generally\ninvolves larger networks and consequently longer testing time. We are using MNIST dataset and the\ntuning parameters include number of hidden layers (x1 \u2208 [1, 3]), the number of hidden units per layer\n(x2 \u2208 [50, 300]), the learning rate (x3 \u2208 (0, 0.2]), amount of dropout (x4 \u2208 [0.4, 0.8]), and the level\nof l1 (x5 \u2208 (0, 0.1]) and l2 (x6 \u2208 (0, 0.1]) regularization. For this problem we assume stability of\nf1(time) in minimising procedure is more important than the f0(error). For MOBO-RS method,\nwe selected [[0.02, 0], [0.03, 2]] bounding box to represent an accurate prior knowledge (see Figure\n5). The results were averaged over 5 independent runs. Figure 5 illustrates that one can simply ask\nfor more stable solutions with respect to test time (without any prior knowledge) of a neural network\nwhile optimising the hyperparameters. As all the solutions found with MOBO-PC are in range of\n(0, 5) test time. In addition, it seems the proposed method \ufb01nds more number of Pareto front solutions\nin comparison with MOBO-RS.\n7 Conclusion\nIn this paper we proposed a novel multi-objective Bayesian optimisation algorithm with preferences\nover objectives. We de\ufb01ne objective preferences in terms of stability and formulate a common\nframework to focus on the sections of the Pareto front where preferred objectives are more stable, as\nis required. We evaluate our method on both synthetic and real-world problems and show that the\nobtained Pareto fronts comply with the preference-order constraints.\nAcknowledgments\nThis research was partially funded by Australian Government through the Australian Research Council\n(ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).\n\n9\n\nf02468f115.015.415.816.216.6f2\u22120.10\u22120.050.000.050.100.15FullParetopointsParetopoints45505560Weight(\u00d710\u22123)12345f015.015.516.016.517.0f1Paretopoints15.015.516.016.517.0f1\u22120.10\u22120.050.000.050.100.15f2Paretopoints0.0150.0200.0250.030f0(error)051015f1(time)SUR\u2212100PESMO\u2212100PAREGO\u2212100SMSEGO\u2212100MOBO\u2212PC\u2212100MOBO\u2212RS\u22121000.0150.0200.0250.030f0(error)051015f1(time)SUR\u2212200PESMO\u2212200PAREGO\u2212200SMSEGO\u2212200MOBO\u2212PC\u2212200MOBO\u2212RS\u2212200\fReferences\n[1] Roberto Calandra, Nakul Gopalan, Andr\u00e9 Seyfarth, Jan Peters, and Marc Peter Deisenroth.\nBayesian gait optimization for bipedal locomotion. In International Conference on Learning\nand Intelligent Optimization, pages 274\u2013290. Springer, 2014.\n\n[2] Roberto Calandra, Andr\u00e9 Seyfarth, Jan Peters, and Marc Peter Deisenroth. Bayesian optimiza-\ntion for learning gaits under uncertainty. Annals of Mathematics and Arti\ufb01cial Intelligence,\n76(1-2):5\u201323, 2016.\n\n[3] Kalyanmoy Deb. Multi-objective optimization. In Search methodologies, pages 273\u2013316.\n\nSpringer, 2005.\n\n[4] Kalyanmoy Deb. Multi-objective optimization. In Search methodologies, pages 403\u2013449.\n\nSpringer, 2014.\n\n[5] Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and Tanaka Meyarivan. A fast elitist non-\ndominated sorting genetic algorithm for multi-objective optimization: Nsga-ii. In International\nconference on parallel problem solving from nature, pages 849\u2013858. Springer, 2000.\n\n[6] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of\n\nthe Royal Statistical Society: Series B (Statistical Methodology), 68(3):411\u2013436, 2006.\n\n[7] Michael Emmerich and Jan-willem Klinkenberg. The computation of the expected improve-\nment in dominated hypervolume of pareto front approximations. Rapport technique, Leiden\nUniversity, 34, 2008.\n\n[8] Paul Feliot, Julien Bect, and Emmanuel Vazquez. A bayesian approach to constrained single-and\n\nmulti-objective optimization. Journal of Global Optimization, 67(1-2):97\u2013133, 2017.\n\n[9] Daniel Hern\u00e1ndez-Lobato, Jose Hernandez-Lobato, Amar Shah, and Ryan Adams. Predictive\nIn International Conference on\n\nentropy search for multi-objective bayesian optimization.\nMachine Learning, pages 1492\u20131501, 2016.\n\n[10] Simon Huband, Phil Hingston, Lyndon While, and Luigi Barone. An evolution strategy with\nprobabilistic mutation for multi-objective optimisation. In Proceedings of the IEEE Congress\non Evolutionary Computation, volume 4, pages 2284\u20132291, 2003.\n\n[11] Iris Hupkens, Andr\u00e9 Deutz, Kaifeng Yang, and Michael Emmerich. Faster exact algorithms for\ncomputing expected hypervolume improvement. In International Conference on Evolutionary\nMulti-Criterion Optimization, pages 65\u201379. Springer, 2015.\n\n[12] Ilija Ilievski, Taimoor Akhtar, Jiashi Feng, and Christine Annette Shoemaker. Ef\ufb01cient hy-\nperparameter optimization for deep learning algorithms using deterministic rbf surrogates. In\nThirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[13] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter.\n\nFast\nbayesian optimization of machine learning hyperparameters on large datasets. arXiv preprint\narXiv:1605.07079, 2016.\n\n[14] Joshua Knowles. Parego: a hybrid algorithm with on-line landscape approximation for expen-\nsive multiobjective optimization problems. IEEE Transactions on Evolutionary Computation,\n10(1):50\u201366, 2006.\n\n[15] Marco Laumanns and Jiri Ocenasek. Bayesian optimization algorithms for multi-objective\noptimization. In International Conference on Parallel Problem Solving from Nature, pages\n298\u2013307. Springer, 2002.\n\n[16] Cheng Li, David Rub\u00edn de Celis Leal, Santu Rana, Sunil Gupta, Alessandra Sutti, Stewart\nGreenhill, Teo Slezak, Murray Height, and Svetha Venkatesh. Rapid bayesian optimisation for\nsynthesis of short polymer \ufb01ber materials. Scienti\ufb01c reports, 7(1):5683, 2017.\n\n[17] A. O\u2019Hagan. Some bayesian numerical analysis. Bayesian Statistics, 7:345\u2013363, 1992.\n\n10\n\n\f[18] Biswajit Paria, Kirthevasan Kandasamy, and Barnab\u00e1s P\u00f3czos. A \ufb02exible multi-objective\n\nbayesian optimization approach using random scalarizations. CoRR, abs/1805.12168, 2018.\n\n[19] Victor Picheny. Multiobjective optimization using gaussian process emulators via stepwise\n\nuncertainty reduction. Statistics and Computing, 25(6):1265\u20131280, 2015.\n\n[20] Wolfgang Ponweiser, Tobias Wagner, Dirk Biermann, and Markus Vincze. Multiobjective\noptimization on a limited budget of evaluations using model-assisted s-metric selection. In\nInternational Conference on Parallel Problem Solving from Nature, pages 784\u2013794. Springer,\n2008.\n\n[21] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on\n\nmachine learning, pages 63\u201371. Springer, 2004.\n\n[22] Carl Edward Rasmussen. Gaussian processes to speed up hybrid monte carlo for expensive\n\nbayesian integrals. Bayesian statistics, 7:651\u2013659, 2008.\n\n[23] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine\n\nLearning. MIT Press, 2006.\n\n[24] James David Schaffer. Some experiments in machine learning using vector evaluated genetic\n\nalgorithms (arti\ufb01cial intelligence, optimization, adaptation, pattern recognition). 1984.\n\n[25] Alistair Shilton, Santu Rana, Sinil Kumar Gupta, and Svetha Venkatesh. A simple recursive al-\ngorithm for calculating expected hypervolume improvement. In BayesOpt2017 NIPS Workshop\non Bayesian Optimisation, 2017.\n\n[26] Ofer M. Shir, Michael Emmerich, Thomas Back, and Marc J. J. Vrakking. The application of\nevolutionary multi-criteria optimization to dynamic molecular aligment. In Proceedings of 2007\nIEEE Congress on Evolutionary Computation, 2007.\n\n[27] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine\nlearning algorithms. In Advances in neural information processing systems, pages 2951\u20132959,\n2012.\n\n[28] Kristof Van Moffaert, Madalina M Drugan, and Ann Now\u00e9. Scalarized multi-objective reinforce-\nment learning: Novel design techniques. In Adaptive Dynamic Programming and Reinforcement\nLearning (ADPRL), pages 191\u2013199, 2013.\n\n[29] Lyndon While, Philip Hingston, Luigi Barone, and Simon Huband. A faster algorithm for\ncalculating hypervolume. IEEE Transactions on Evolutionary Computation, 10(1):29\u201338, 2006.\n\n[30] Martin Zaefferer, Thomax Bartz-Beielstein, Boris Naujoks, Tobias Wagner, and Michael Em-\nmerich. A case study on multi-criteria optimization of an event detection software under limited\nbudgets. In Proceedings of the 2013 International Conference on Evolutionary Multi-Criterion\nOptimization, pages 756\u2013770. Springer, 2013.\n\n[31] Eckart Zitzler. Evolutionary Algorithms for Multiobjective Optimization: Methods and Applica-\n\ntions. PhD thesis, Swiss Federal Institute of Technology Zurich, 1999.\n\n11\n\n\f", "award": [], "sourceid": 6625, "authors": [{"given_name": "Majid", "family_name": "Abdolshah", "institution": "Deakin University"}, {"given_name": "Alistair", "family_name": "Shilton", "institution": "Deakin University"}, {"given_name": "Santu", "family_name": "Rana", "institution": "Deakin University"}, {"given_name": "Sunil", "family_name": "Gupta", "institution": "Deakin University"}, {"given_name": "Svetha", "family_name": "Venkatesh", "institution": "Deakin University"}]}