{"title": "Query Complexity of Derivative-Free Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2672, "page_last": 2680, "abstract": "Derivative Free Optimization (DFO) is attractive when the objective function's derivatives are not available and evaluations are costly.   Moreover, if the function evaluations are noisy, then approximating gradients by finite differences is difficult.  This paper gives quantitative lower bounds on the performance of DFO with noisy function evaluations, exposing a fundamental and unavoidable gap between optimization performance based on noisy evaluations versus noisy gradients. This challenges the conventional wisdom that the method of finite differences is comparable to a stochastic gradient.  However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions.  A distinctive feature of the algorithm is that it only uses Boolean-valued function comparisons, rather than evaluations.  This makes the algorithm useful in an even wider range of applications, including optimization based on paired comparisons from human subjects, for example.  Remarkably, we show that regardless of whether DFO is based on noisy function evaluations or Boolean-valued function comparisons, the convergence rate is the same.", "full_text": "Query Complexity of Derivative-Free Optimization\n\nKevin G. Jamieson\n\nUniversity of Wisconsin\nMadison, WI 53706, USA\nkgjamieson@wisc.edu\n\nRobert D. Nowak\n\nUniversity of Wisconsin\nMadison, WI 53706, USA\nnowak@engr.wisc.edu\n\nBenjamin Recht\n\nUniversity of Wisconsin\nMadison, WI 53706, USA\nbrecht@cs.wisc.edu\n\nAbstract\n\nThis paper provides lower bounds on the convergence rate of Derivative Free Op-\ntimization (DFO) with noisy function evaluations, exposing a fundamental and\nunavoidable gap between the performance of algorithms with access to gradients\nand those with access to only function evaluations. However, there are situations\nin which DFO is unavoidable, and for such situations we propose a new DFO al-\ngorithm that is proved to be near optimal for the class of strongly convex objective\nfunctions. A distinctive feature of the algorithm is that it uses only Boolean-valued\nfunction comparisons, rather than function evaluations. This makes the algorithm\nuseful in an even wider range of applications, such as optimization based on paired\ncomparisons from human subjects, for example. We also show that regardless of\nwhether DFO is based on noisy function evaluations or Boolean-valued function\ncomparisons, the convergence rate is the same.\n\n1\n\nIntroduction\n\nOptimizing large-scale complex systems often requires the tuning of many parameters. With train-\ning data or simulations one can evaluate the relative merit, or incurred loss, of different parameter\nsettings, but it may be unclear how each parameter in\ufb02uences the overall objective function. In such\ncases, derivatives of the objective function with respect to the parameters are unavailable. Thus,\nwe have seen a resurgence of interest in Derivative Free Optimization (DFO) [1, 2, 3, 4, 5, 6, 7, 8].\nWhen function evaluations are noiseless, DFO methods can achieve the same rates of convergence\nas noiseless gradient methods up to a small factor depending on a low-order polynomial of the di-\nmension [9, 5, 10]. This leads one to wonder if the same equivalence can be extended to the case\nwhen function evaluations and gradients are noisy.\nSadly, this paper proves otherwise. We show that when function evaluations are noisy, the opti-\n\nmization error of any DFO is \u2326(p1/T ), where T is the number of evaluations. This lower bound\n\nholds even for strongly convex functions. In contrast, noisy gradient methods exhibit \u21e5(1/T ) error\nscaling for strongly convex functions [9, 11]. A consequence of our theory is that \ufb01nite differencing\ncannot achieve the rates of gradient methods when the function evaluations are noisy.\nOn the positive side, we also present a new derivative-free algorithm that achieves this lower bound\nwith near optimal dimension dependence. Moreover, the algorithm uses only boolean comparisons\nof function values, not actual function values. This makes the algorithm applicable to situations in\nwhich the optimization is only able to probably correctly decide if the value of one con\ufb01guration is\nbetter than the value of another. This is especially interesting in optimization based on human subject\nfeedback, where paired comparisons are often used instead of numerical scoring. The convergence\nrate of the new algorithm is optimal in terms of T and near-optimal in terms of its dependence\non the ambient dimension. Surprisingly, our lower bounds show that this new algorithm that uses\nonly function comparisons achieves the same rate in terms of T as any algorithm that has access to\nfunction evaluations.\n\n1\n\n\f2 Problem formulation and background\n\nWe now formalize the notation and conventions for our analysis of DFO. A function f is strongly\nconvex with constant \u2327 on a convex set B\u21e2 Rd if there exists a constant \u2327> 0 such that\n\nf (y)  f (x) + hrf (x), y  xi +\n\n\u2327\n2||x  y||2\n\nfor all x, y 2B . The gradient of f, if it exists, denoted rf, is Lipschitz with constant L if\n||rf (x)  rf (y)|| \uf8ff L||x  y|| for some L > 0. The class of strongly convex functions with\nLipschitz gradients de\ufb01ned on a nonempty, convex set B\u21e2 Rn which take their minimum in B with\nparameters \u2327 and L is denoted by F\u2327,L,B.\nThe problem we consider is minimizing a function f 2F \u2327,L,B. The function f is not explicitly\nknown. An optimization procedure may only query the function in one of the following two ways.\nFunction Evaluation Oracle: For any point x 2B an optimization procedure can observe\n\nEf (x) = f (x) + w\n\nFunction Comparison Oracle: For any pair of points x, y 2B an optimization procedure can\n\nwhere w 2 R is a random variable with E[w] = 0 and E[w2] = 2.\nobserve a binary random variable Cf (x, y) satisfying\n\nP (Cf (x, y) = sign{f (y)  f (x)}) \n\n(1)\nfor some 0 < 0 \uf8ff 1/2, \u00b5 > 0 and \uf8ff  1. When \uf8ff = 1, without loss of generality\nassume \u00b5 \uf8ff 0 \uf8ff 1/2. Note \uf8ff = 1 implies that the comparison oracle is correct with\na probability that is greater than 1/2 and independent of x, y. If \uf8ff> 1, then the oracle\u2019s\nreliability decreases as the difference between f (x) and f (y) decreases.\n\n1\n2\n\n+ min0, \u00b5|f (y)  f (x)|\uf8ff1 \n\nTo illustrate how the function comparison oracle and function evaluation oracles relate to each other,\nsuppose Cf (x, y) = sign{Ef (y)  Ef (x)} where Ef (x) is a function evaluation oracle with ad-\nIf w is Gaussian distributed with mean zero and variance 2 then \uf8ff = 2 and\nditive noise w.\n\u00b5 4\u21e12e1/2 (see supplementary materials). In fact, this choice of w corresponds to Thurston\u2019s\nlaw of comparative judgment which is a popular model for outcomes of pairwise comparisons from\nhuman subjects [12]. If w is a \u201cspikier\u201d distribution such as a two-sided Gamma distribution with\nshape parameter in the range of (0, 1] then all values of \uf8ff 2 (1, 2] can be realized (see supplementary\nmaterials).\nInterest in the function comparison oracle is motivated by certain popular derivative-free optimiza-\ntion procedures that use only comparisons of function evaluations (e.g. [7]) and by optimization\nproblems involving human subjects making paired comparisons (for instance, getting \ufb01tted for pre-\nscription lenses or a hearing aid where unknown parameters speci\ufb01c to each person are tuned with\nthe familiar queries \u201cbetter or worse?\u201d). Pairwise comparisons have also been suggested as a novel\nway to tune web-search algorithms [13]. Pairwise comparison strategies have previously been an-\nalyzed in the \ufb01nite setting where the task is to identify the best alternative among a \ufb01nite set of\nalternatives (sometimes referred to as the dueling-bandit problem) [13, 14]. The function compar-\nison oracle presented in this work and its analysis are novel. The main contributions of this work\nand new art are as follows (i) lower bounds for the function evaluation oracle in the presence of\nmeasurement noise (ii) lower bounds for the function comparison oracle in the presence of noise\nand (iii) an algorithm for the function comparison oracle, which can also be applied to the function\nevaluation oracle setting, that nearly matches both the lower bounds of (i) and (ii).\nWe prove our lower bounds for strongly convex functions with Lipschitz gradients de\ufb01ned on a com-\npact, convex set B, and because these problems are a subset of those involving all convex functions\n(and have non-empty intersection with problems where f is merely Lipschitz), the lower bound also\napplies to these larger classes. While there are known theoretical results for DFO in the noiseless\nsetting [15, 5, 10], to the best of our knowledge we are the \ufb01rst to characterize lower bounds for\nDFO in the stochastic setting. Moreover, we believe we are the \ufb01rst to show a novel upper bound for\nstochastic DFO using a function comparison oracle (which also applies to the function evaluation\noracle). However, there are algorithms with upper bounds on the rates of convergence for stochastic\n\n2\n\n\fDFO with the function evaluation oracle [15, 16]. We discuss the relevant results in the next section\nfollowing the lower bounds .\nWhile there remains many open problems in stochastic DFO (see Section 6), rates of convergence\nwith a stochastic gradient oracle are well known and were \ufb01rst lower bounded by Nemirovski and\nYudin [15]. These classic results were recently tightened to show a dependence on the dimension\nof the problem [17]. And then tightened again to show a better dependence on the noise [11] which\nmatches the upper bound achieved by stochastic gradient descent [9]. The aim of this work is to\nstart \ufb01lling in the knowledge gaps of stochastic DFO so that it is as well understood as the stochastic\ngradient oracle. Our bounds are based on simple techniques borrowed from the statistical learning\nliterature that use natural functions and oracles in the same spirit of [11].\n\n3 Main results\n\nThe results below are presented with simplifying constants that encompass many factors to aid in\nexposition. Explicit constants are given in the proofs in Sections 4 and 5. Throughout, we denote\nthe minimizer of f as x\u21e4f . The expectation in the bounds is with respect to the noise in the oracle\nqueries and (possible) optimization algorithm randomization.\n\n3.1 Query complexity of the function comparison oracle\nTheorem 1. For every f 2F \u2327,L,B let Cf be a function comparison oracle with parameters\n(\uf8ff, \u00b5, 0). Then for n  8 and suf\ufb01ciently large T\n\nsup\n\nf2F\u2327,L,B\n\ninf\n\nbxT\n\nE\u21e5f (bxT )  f (x\u21e4f )\u21e4 8<:\n\nc1 expc2\nc3 n\nT\n\n2(\uf8ff1)\n\n1\n\nT\n\nn \n\nif \uf8ff = 1\n\nif \uf8ff> 1\n\nwhere the in\ufb01mum is over the collection of all possible estimators of x\u21e4f using at most T queries to\na function comparison oracle and the supremum is taken with respect to all problems in F\u2327,L,B and\nfunction comparison oracles with parameters (\uf8ff, \u00b5, 0). The constants c1, c2, c3 depend the oracle\nand function class parameters, as well as the geometry of B, but are independent of T and n.\nFor upper bounds we propose a speci\ufb01c algorithm based on coordinate-descent in Section 5 and\nprove the following theorem for the case of unconstrained optimization, that is, B = Rn.\nTheorem 2. For every f 2F \u2327,L,B with B = Rn let Cf be a function comparison oracle with\nparameters (\uf8ff, \u00b5, 0). Then there exists a coordinate-descent algorithm that is adaptive to unknown\n\uf8ff  1 that outputs an estimatebxT after T function comparison queries such that with probability\n\n1  \n\nsup\n\nf2F\u2327,L,B\n\nE\u21e5f (bxT )  f (x\u21e4f )\u21e4 \uf8ff8><>:\n\nc1 expnc2q T\nc3n n\nT\n\n2(\uf8ff1)\n\n1\n\nno if \uf8ff = 1\n\nif \uf8ff> 1\n\nwhere c1, c2, c3 depend the oracle and function class parameters as well as T ,n, and 1/, but only\npoly-logarithmically.\n\n3.2 Query complexity of the function evaluation oracle\nTheorem 3. For every f 2F \u2327,L,B let Ef be a function evaluation oracle with variance 2. Then\nfor n  8 and suf\ufb01ciently large T\n\nsup\n\nf2F\u2327,L,B\n\ninf\n\nbxT\n\nE\u21e5f (bxT )  f (x\u21e4f )\u21e4  c\u2713 n2\nT \u25c6 1\n\n2\n\nwhere the in\ufb01mum is taken with respect to the collection of all possible estimators of x\u21e4f using just\nT queries to a function evaluation oracle and the supremum is taken with respect to all problems in\nF\u2327,L,B and function evaluation oracles with variance 2. The constant c depends on the oracle and\nfunction class parameters, as well as the geometry of B, but is independent of T and n.\n\n3\n\n\fBecause a function evaluation oracle can always be turned into a function comparison oracle (see\ndiscussion above), the algorithm and upper bound in Theorem 2 with \uf8ff = 2 applies to many typical\n\nfunction evaluation oracles (e.g. additive Gaussian noise), yielding an upper bound ofn32/T1/2\n\nignoring constants and log factors. This matches the rate of convergence as a function of T and 2,\nbut has worse dependence on the dimension n.\nAlternatively, under a less restrictive setting, Nemirovski and Yudin proposed two algorithms for\nthe class of convex, Lipschitz functions that obtain rates of n1/2/T 1/4 and p(n)/T 1/2, respectively,\nwhere p(n) was left as an unspeci\ufb01ed polynomial of n [15]. While focusing on stochastic DFO with\nbandit feedback, Agarwal et. al. built on the ideas developed in [15] to obtain a result that they\npoint out implies a convergence rate of n16/T 1/2 in the optimization setting considered here [16].\nWhether or not these rates can be improved to those obtained under the more restrictive function\nclasses of above is an open question.\nA related but fundamentally different problem that is somewhat related with the setting considered\nin this paper is described as online (or stochastic) convex optimization with multi-point feedback\n[18, 5, 19]. Essentially, this setting allows the algorithm to probe the value of the function f plus\nnoise at multiple locations where the noise changes at each time step, but each set of samples at each\ntime experiences the same noise. Because the noise model of that work is incompatible with the one\nconsidered here, no comparisons should be made between the two.\n\n4 Lower Bounds\n\nThe lower bounds in Theorems 1 and 3 are proved using a general minimax bound [20, Thm. 2.5].\nOur proofs are most related to the approach developed in [21] for active learning, which like opti-\nmization involves a Markovian sampling process. Roughly speaking, the lower bounds are estab-\nlished by considering a simple case of the optimization problem in which the global minimum is\nknown a priori to belong to a \ufb01nite set. Since the simple case is \u201ceasier\u201d than the original optimiza-\ntion, the minimum number of queries required for a desired level of accuracy in this case yields a\nlower bound for the original problem.\nThe following theorem is used to prove the bounds. In the terms of the theorem, f is a function to\nbe minimized and Pf is the probability model governing the noise associated with queries when f\nis the true function.\nTheorem 4. [20, Thm. 2.5] Consider a class of functions F and an associated family of probability\nmeasures {Pf}f2F. Let M  2 be an integer and f0, f1, . . . , fM be functions in F. Let d(\u00b7,\u00b7) :\nF\u21e5F! R be a semi-distance and assume that:\n\n1\n\n1. d(fi, fj)  2s > 0, for all 0 \uf8ff i < j \uf8ff M,\nMPM\n2.\n\nj=1 KL(Pi||P0) \uf8ff a log M,\n\ndP0\n\npM\n\nmax\n\nf2{f0,...,fM}\n\ninf\n\nsup\nf2F\n\ndPi is assumed to be well-de\ufb01ned\n\n(i.e., P0 is a dominating measure) and 0 < a < 1/8 . Then\n\nP(d(bf , f )  s)  inf\nbf\n\n1+pM\u21e31  2a  2q a\n\nwhere the Kullback-Leibler divergence KL(Pi||P0) :=R log dPi\nlog M\u2318 > 0 ,\nP(d(bf , f )  s) \nbf\nwhere the in\ufb01mum is taken over all possible estimators based on a sample from Pf .\nWe are concerned with the functions in the class F := F\u2327,L,B. The volume of B will affect only\nconstant factors in our bounds, so we will simply denote the class of functions by F and refer\nexplicitly to B only when necessary. Let xf := arg minx f (x), for all f 2F . The semi-distance we\nuse is d(f, g) := kxf  xg||, for all f, g 2F . Note that each point in B can be speci\ufb01ed by one of\nmany f 2F . So the problem of selecting an f is equivalent to selecting a point x 2B . Indeed, the\nsemi-distance de\ufb01nes a collection of equivalence classes in F (i.e., all functions having a minimum\nat x 2B are equivalent). For every f 2F we have inf g2F f (xg) = inf x2B f (x), which is a useful\nidentity to keep in mind.\nWe now construct the functions f0, f1, . . . , fM that will be used for our proofs. Let \u2326= {1, 1}n so\nthat each ! 2 \u2326 is a vertex of the d-dimensional hypercube. Let V\u21e2 \u2326 with cardinality |V|  2n/8\n\n4\n\n\fsuch that for all ! 6= !0 2V , we have \u21e2(!, !0)  n/8 where \u21e2(\u00b7,\u00b7) is the Hamming distance. It is\nknown that such a set exists by the Varshamov-Gilbert bound [20, Lemma 2.9]. Denote the elements\nof V by !0,! 1, . . . ,! M. Next we state some elementary bounds on the functions that will be used\nin our analysis.\nLemma 1. For \u270f> 0 de\ufb01ne the set B\u21e2 Rn to be the `1 ball of radius \u270f and de\ufb01ne the functions\non B: fi(x) := \u2327\n2||x  \u270f!i||2, for i = 0, . . . , M, !i 2V , and xi := arg minx fi(x) = \u270f!i. Then\nfor all 0 \uf8ff i < j \uf8ff M and x 2B the functions fi(x) satisfy\n\n1. fi is strongly convex-\u2327 with Lipschitz-\u2327 gradients and xi 2B\n2. ||xi  xj||  \u270fp n\n3. |fi(x)  fj(x)|\uf8ff 2\u2327n\u270f 2 .\n\n2\n\nWe are now ready to prove Theorems 1 and 3. Each proof uses the functions f0, . . . , fM a bit\ndifferently, and since the noise model is also different in each case, the KL divergence is bounded\ndifferently in each proof. We use the fact that if X and Y are random variables distributed according\nto Bernoulli distributions PX and PY with parameters 1/2 + \u00b5 and 1/2  \u00b5, then KL(PX||PY ) \uf8ff\n4\u00b52/(1/2  \u00b5). Also, if X \u21e0N (\u00b5X, 2) =: PX and Y \u21e0N (\u00b5Y , 2) =: Py then KL(PX||PY ) =\n22||\u00b5X  \u00b5Y ||2.\n4.1 Proof of Theorem 1\n\n1\n\nFirst we will obtain the bound for the case \uf8ff> 1. Let the comparison oracle satisfy\n\n1\n2\n\nwith an a small enough so that we can apply the theorem. By setting a = 1/16 and equating the two\n4(\uf8ff1) (note that this also implies a\nsides of the equation we have \u270f = \u270fT := 1\n\nKL(Pi,T|Pj,T ) \uf8ff 16T \u00b522\u2327n\u270f 22(\uf8ff1) \uf8ff a\n\u23271/2\u21e3 n log(2)\n\n2pn 2\n\n2048\u00b52T\u2318 1\n\nn\n8\n\nlog(2) \uf8ff a log M\n\n5\n\nP (Cfi(x, y) = sign{fi(y)  fi(x)}) =\n\n+ min\u00b5|fi(y)  fi(x)|\uf8ff1, 0 .\nIn words, Cfi(x, y) is correct with probability as large as the right-hand-side of above and is\nmonotonic increasing in fi(y)  fi(x). Let {xk, yk}T\nk=1 be a sequence of T pairs in B and let\nk=1 be the corresponding sequence of noisy comparisons. We allow the sequence\n{Cfi(xk, yk)}T\nk=1 to be generated in any way subject to the Markovian assumption that Cfi(xk, yk) given\n{xk, yk}T\n(xk, yk) is conditionally independent of {xi, yi}i<k. For i = 0, . . . , M, and ` = 1, . . . , T let Pi,`\ndenote the joint probability distribution of {xk, yk, Cfi(xk, yk)}`\nk=1, let Qi,` denote the conditional\ndistribution of Cfi(x`, y`) given (x`, y`), and let S` denote the conditional distribution of (x`, y`)\ngiven {xk, yk, Cfi(xk, yk)}`1\nk=1. Note that S` is only a function of the underlying optimization al-\ngorithm and does not depend on i.\nKL(Pi,T||Pj,T ) = EPi,T\uf8fflog\nEPi,T\uf8ffEPi,T\uf8fflog\n\nPj,T = EPi,T\"logQT\n`=1 Qj,`S`# = EPi,T\"logQT\n`=1 Qj,`#\nQT\nQT\nQj,1x1, y1\nQj,`{xk, yk}T\nEPi,1\uf8ffEPi,1\uf8fflog\n\nk=1 \uf8ff T sup\n\nBy the second claim of Lemma 1, |fi(x)  fj(x)|\uf8ff 2\u2327n\u270f 2, and therefore the bound above is\nless than or equal to the KL divergence between the Bernoulli distributions with parameters 1\n2 \u00b1\n\u00b52\u2327n\u270f 2(\uf8ff1), yielding the bound\nKL(Pi,T|Pj,T ) \uf8ff\n\n4T \u00b522\u2327n\u270f 22(\uf8ff1)\n1/2  \u00b5 (2\u2327n\u270f 2)(\uf8ff1) \uf8ff 16T \u00b522\u2327n\u270f 22(\uf8ff1)\n\nprovided \u270f is suf\ufb01ciently small. We also assume \u270f (or, equivalently, B) is suf\ufb01ciently small so that\n|fi(x)  fj(x)|\uf8ff1 \uf8ff 0. We are now ready to apply Theorem 4. Recalling that M  2n/8, we\nwant to choose \u270f such that\n\nTX`=1\n\n`=1 Qi,`S`\n\n`=1 Qi,`\n\nPi,T\n\nQi,`\n\nx1,y12B\n\nQi,1\n\n=\n\n\f1\n\ninf\n\n4(\uf8ff1)\n\nsup\nf2F\n\n=: 2sT .\n\nApplying Theorem 4 we have\n\nsequence of sets BT by the de\ufb01nition of the functions in Lemma 1). Thus, the semi-distance satis\ufb01es\n\n2048\u00b52T\u25c6 1\n\nd(fj, fi) = ||xj  xi|| pn/2\u270fT \nP(d(bf , fi)  sT )\nP(kxbf  xfk  sT )  inf\nbf\nbf\ni2{0,...,M}\n1+pM\u21e31  2a  2q a\npM\n\nwhere the \ufb01nal inequality holds since M  2 and a = 1/16. Strong convexity implies that f (x) \n2||x  xf||2 for all f 2F and x 2B . Therefore\nf (xf )  \u2327\nmax\nsup\ninf\nf2F\n\n2p2\u2713 2\n\u2327\u25c61/2\u2713 n log(2)\nP(kxbf  xik  sT ) = inf\nbf\nlog M\u2318 > 1/7 ,\n\nP\u21e3f (xbf )  f (xf ) \n\ni2{0,...,M}\n\ni2{0,...,M}\n\nmax\n\nmax\n\nbf\n\ns2\n\n\u2327\n2\n\nT\u2318  inf\nbf\n inf\nbf\n= inf\nbf\nEhf (xbf )  f (xf )i \n\ns2\n\n\u2327\n2\n\nT\u2318\nP\u21e3fi(xbf )  fi(xi) \nT\u2318\nP\u21e3 \u2327\n2kxbf  xik2 \nP\u21e3kxbf  xik  sT\u2318 > 1/7 .\n\n\u2327\n2\n\ns2\n\nmax\n\ni2{0,...,M}\n\nmax\n\ni2{0,...,M}\n\n1\n\n7\u2713 1\n32\u25c6\u2713 n log(2)\n\n2048\u00b52T\n\n2(\uf8ff1)\n\n.\u25c6 1\n\nFinally, applying Markov\u2019s inequality we have\n\nsup\nf2F\n\ninf\n\nbf\n\n4.2 Proof of Theorem 1 for \uf8ff = 1\n\nTo handle the case when \uf8ff = 1 we use functions of the same form, but the construction is slightly\ni=1 be a set of uniformly space points\ndifferent. Let ` be a positive integer and let M = `n. Let {\u21e0i}M\nin B which we de\ufb01ne to be the unit cube in Rn, so that k\u21e0i  \u21e0jk  `1 for all i 6= j. De\ufb01ne\nfi(x) := ||x  \u21e0i||2, i = 1, . . . , M. Let s := 1\n2` so that d(fi, fj) := ||x\u21e4i  x\u21e4j||  2s. Because\n\uf8ff = 1, we have P (Cfi(x, y) = sign{fi(y)  fi(x)})  \u00b5 for some \u00b5 > 0, all i 2{ 1, . . . , M}, and\nall x, y 2B . We bound KL(Pi,T||Pj,T ) in exactly the same way as we bounded it in Section 4.1\nexcept that now we have Cfi(xk, yk) \u21e0 Bernoulli( 1\n2  \u00b5). It\nthen follows that if we wish to apply the theorem, we want to choose s so that\n\n2 + \u00b5) and Cfj (xk, yk) \u21e0 Bernoulli( 1\n\nfor some a < 1/8. Using the same sequence of steps as in Section 4.1 we have\n\nKL(Pi,T|Pj,T ) \uf8ff 2T \u00b52/(1/2  \u00b5) \uf8ff a log M = an log 1\n2s\nn(1/2  \u00b5) .\n\nEhf (xbf )  f (xf )i \n\n7\u2713 1\n2\u25c62\n\nexp\u21e2\n\n128T \u00b52\n\nsup\nf2F\n\n1\n\ninf\n\nbf\n\n4.3 Proof of Theorem 3\n\nLet fi for all i = 0, . . . , M be the functions considered in Lemma 1. Recall that the evaluation oracle\nis de\ufb01ned to be Ef (x) := f (x) + w, where w is a random variable (independent of all other random\nvariables under consideration) with E[w] = 0 and E[w2] = 2 > 0. Let {xk}n\nk=1 be a sequence\nof points in B\u21e2 Rn and let {Ef (xk)}T\nk=1 denote the corresponding sequence of noisy evaluations\nof f 2F . For ` = 1, . . . , T let Pi,` denote the joint probability distribution of {xk, Efi(xk)}`\nk=1,\nlet Qi,` denote the conditional distribution of Efi(xk) given xk, and let S` denote the conditional\ndistribution of x` given {xk, Ef (xk)}`1\nk=1. S` is a function of the underlying optimization algorithm\nand does not depend on i. We can now bound the KL divergence between any two hypotheses as in\nSection 4.1:\n\nKL(Pi,T||Pj,T ) \uf8ff T sup\nx12B\n\nEPi,1\uf8ffEPi,1\uf8fflog\n\nQi,1\n\nQj,1x1 .\n\n6\n\n\fTo compute a bound, let us assume that w is Gaussian distributed. Then\n\n=\n\nKLN (fi(z), 2)||N (fj(z), 2)\nKL(Pi,T||Pj,T ) \uf8ff T sup\nz2B\n222\u2327n\u270f 22\nT\nz2B |fi(z)  fj(z)|2 \uf8ff\n22 sup\n\u25c6 1\n\n7\u2713 1\n32\u25c6\u2713 n2 log(2)\n\nEhf (xbf )  f (xf )i \n\nsup\nf2F\n\n64T\n\ninf\n\nT\n\n1\n\n.\n\n2\n\nby the third claim of Lemma 1. We then repeat the same procedure as in Section 4.1 to attain\n\nbf\n5 Upper bounds\n\nThe algorithm that achieves the upper bound using a pairwise comparison oracle is a combination\nof standard techniques and methods from the convex optimization and statistical learning literature.\nThe algorithm is explained in full detail in the supplementary materials, and is summarized as fol-\nlows. At each iteration the algorithm picks a coordinate uniformly at random from the n possible\ndimensions and then performs an approximate line search. By exploiting the fact that the func-\ntion is strongly convex with Lipschitz gradients, one guarantees using standard arguments that the\napproximate line search makes a suf\ufb01cient decrease in the objective function value in expectation\n[22, Ch.9.3]. If the pairwise comparison oracle made no errors then the approximate line search\nis accomplished by a binary-search-like scheme, essentially a golden section line-search algorithm\n[23]. However, when responses from the oracle are only probably correct we make the line-search\nrobust to errors by repeating the same query until we can be con\ufb01dent about the true, uncorrupted\ndirection of the pairwise comparison using a standard procedure from the active learning literature\n[24] (a similar technique was also implemented for the bandit setting of derivate-free optimization\n[8]). Because the analysis of each component is either known or elementary, we only sketch the\nproof here and leave the details to the supplementary materials.\n\n5.1 Coordinate descent\nGiven a candidate solution xk after k  0 iterations, the algorithm de\ufb01nes a search direction dk = ei\nwhere i is chosen uniformly at random from the possible n dimensions and ei is a vector of all zeros\nexcept for a one in the ith coordinate. We note that while we only analyze the case where the search\ndirection dk is a coordinate direction, an analysis with the same result can be obtained with dk\nchosen uniformly from the unit sphere. Given dk, a line search is then performed to \ufb01nd an \u21b5k 2 R\nsuch that f (xk+1)  f (xk) is suf\ufb01ciently small where xk+1 = xk + \u21b5kdk. In fact, as we will see in\nthe next section, for some input parameter \u2318> 0, the line search is guaranteed to return an \u21b5k such\nthat |\u21b5k  \u21b5\u21e4|\uf8ff \u2318 where \u21b5\u21e4 = min\u21b52R f (xk + dk\u21b5\u21e4). Using the fact that the gradients of f are\nLipschitz (L) we have\n\nf (xk + \u21b5kdk)  f (xk + \u21b5\u21e4dk) \uf8ff\nthen we have\n\nIf we de\ufb01ne \u02c6\u21b5k = hrf (xk),dki\n\nL\n\nL\n2 ||(\u21b5k  \u21b5\u21e4)dk||2 =\n\nL\n2 |\u21b5k  \u21b5\u21e4|2 \uf8ff\n\nL\n2\n\n\u23182.\n\nf (xk + \u21b5kdk)  f (xk) \uf8ff f (xk + \u21b5\u21e4dk)  f (xk) +\n\uf8ff f (xk + \u02c6\u21b5kdk)  f (xk) +\n\n\u23182\n\nL\n2\nL\n2\n\n\u23182 \uf8ff hrf (xk), dki2\n\n2L\n\n+\n\nL\n2\n\n\u23182\n\nwhere the last line follows from applying the fact that the gradients are Lipschitz (L). Arranging the\nbound and taking the expectation with respect to dk we get\nE [f (xk+1)  f (x\u21e4)]  L\nwhere the second inequality follows from the fact that f is strongly convex (\u2327 ). If we de\ufb01ne \u21e2k :=\nE [f (xk)  f (x\u21e4)] then we equivalently have\n\n2 \u23182 \uf8ff E [f (xk)  f (x\u21e4)]  E[||rf (xk)||2]\n\u25c6 \uf8ff\u21e31 \n\uf8ff\u21e31 \n\n4nL\u2318\u2713\u21e2k \n\n\u21e2k+1 \n\n\uf8ff E [f (xk)  f (x\u21e4)]1  \u2327\n4nL\n\u25c6\n4nL\u2318k\u2713\u21e20 \n\n2nL2\u23182\n\n2nL2\u23182\n\n2nL2\u23182\n\n2nL\n\n\u2327\n\n\u2327\n\n\u2327\n\n\u2327\n\n\u2327\n\nwhich leads to the following result.\n\n7\n\n\fsup\nf\n\nE[f (xK)  f (x\u21e4)] \uf8ff 4nL2\u23182\n\nTheorem 5. Let f 2F \u2327,L,B with B = Rn. For any \u2318> 0 assume the line search returns an \u21b5k that\nis within \u2318 of the optimal after at most T`(\u2318) queries from the pairwise comparison oracle. If xK is\nan estimate of x\u21e4 = arg minx f (x) after requesting no more than K pairwise comparisons, then\n\u231822nL2/\u2327 \u2318 T`(\u2318)\n\nlog\u21e3 f (x0)f (x\u21e4)\nThis implies that if we wish supf E[f (xK)  f (x\u21e4)] \uf8ff \u270f it suf\ufb01ces to take \u2318 =p \u270f\u2327\n\nwhere the expectation is with respect to the random choice of dk at each iteration.\n\n4nL2 so that at\n\nK  4nL\n\nmost 4nL\n\u2327\n\nwhenever\n\n\u2327\n\n\u2327\n\nlog\u21e3 f (x0)f (x\u21e4)\n\n\u270f/2\n\n\u2318 T`p \u270f\u2327\n\n4nL2 pairwise comparisons are requested.\n\n5.2 Line search\nThis section is concerned with minimizing a function f (xk+\u21b5kdk) over some \u21b5k 2 R. In particular,\nwe wish to \ufb01nd an \u21b5k 2 R such that |\u21b5k\u21b5\u21e4|\uf8ff \u2318 where \u21b5\u21e4 = min\u21b52R f (xk +dk\u21b5\u21e4). First assume\nthat the function comparison oracle makes no errors. The line search operates by maintaining a pair\nof boundary points \u21b5+, \u21b5 such that if at some iterate we have \u21b5\u21e4 2 [\u21b5,\u21b5 +] then at the next iterate,\nwe are guaranteed that \u21b5\u21e4 is still contained inside the boundary points but |\u21b5+\u21b5| 1\n2|\u21b5+\u21b5|.\nAn initial set of boundary points \u21b5+ > 0 and \u21b5 < 0 are found using simple binary search. Thus,\nregardless of how far away or close \u21b5\u21e4 is, we converge to it exponentially fast. Exploiting the fact\nthat f is strongly convex (\u2327 ) with Lipschitz (L) gradients we can bound how far away or close \u21b5\u21e4\nis from our initial iterate.\nTheorem 6. Let f 2F \u2327,L,B with B = Rn and let Cf be a function comparison oracle that makes\nno errors. Let x 2 Rn be an initial position and let d 2 Rn be a search direction with ||d|| = 1. If\n\u21b5K is an estimate of \u21b5\u21e4 = arg min\u21b5 f (x + d\u21b5) that is output from the line search after requesting\nno more than K pairwise comparisons, then for any \u2318> 0\n\n|\u21b5K  \u21b5\u21e4|\uf8ff \u2318\n\nwhenever\n\n5.3 Making the line search robust to errors\n\nK  2 log2\u21e3 256L(f (x)f (x+d\u21b5 \u21e4))\n\n\u2327 2\u23182\n\n\u2318 .\n\nNow assume that the responses from the pairwise comparison oracle are only probably correct in\naccordance with the model introduced above. Essentially, the robust procedure runs the line search\nas if the oracle made no errors except that each time a comparison is needed, the oracle is repeatedly\nqueried until we can be con\ufb01dent about the true direction of the comparison. This strategy applied\nto active learning is well known because of its simplicity and its ability to adapt to unknown noise\nconditions [24]. However, we mention that when used in this way, this sampling procedure is known\nto be sub-optimal so in practice, one may want to implement a more ef\ufb01cient approach like that of\n[21]. Nevertheless, we have the following lemma.\nLemma 2. [24] For any x, y 2B with P (Cf (x, y) = sign{f (y)  f (x)}) = p, with probability\nat least 1   the coin-tossing algorithm of [24] correctly identi\ufb01es the sign of E [Cf (x, y)] and\nrequests no more than log(2/)\n\n4|1/2p|2 log2\u21e3 log(2/)\n\n4|1/2p|2\u2318 pairwise comparisons.\n\nIt would be convenient if we could simply apply the result of Lemma 2 to our line search procedure.\nUnfortunately, if we do this there is no guarantee that |f (y)  f (x)| is bounded below so for the\ncase when \uf8ff> 1, it would be impossible to lower bound |1/2  p| in the lemma. To account\nfor this, we will sample at multiple locations per iteration as opposed to just two in the noiseless\nalgorithm to ensure that we can always lower bound |1/2  p|. Intuitively, strong convexity ensures\nthat f cannot be arbitrarily \ufb02at so for any three equally spaced points x, y, z on the line dk, if\nf (x) is equal to f (y), then it follows that the absolute difference between f (x) and f (z) must be\nbounded away from zero. Applying this idea and union bounding over the total number of times\none must call the coin-tossing algorithm, one \ufb01nds that with probability at least 1  , the total\nnumber of calls to the pairwise comparison oracle over the course of the whole algorithm does\n\u2318 log(n/)\u2318 . By \ufb01nding a T > 0 that satis\ufb01es this\n2(\uf8ff1)\u2318 for \uf8ff> 1 and\n\nnot exceed eO\u21e3 nL\n\u270f2(\uf8ff1) log2\u21e3 f (x0)f (x\u21e4)\n\u2327  n\nbound for any \u270f we see that this is equivalent to a rate of O\u21e3n log(n/) n\nT\nn log(n/)o\u2318 for \uf8ff = 1, ignoring polylog factors.\nO\u21e3expncq T\n\n\u270f\n\n1\n\n8\n\n\fReferences\n[1] T. Eitrich and B. Lang. Ef\ufb01cient optimization of support vector machine learning parameters\nfor unbalanced datasets. Journal of computational and applied mathematics, 196(2):425\u2013436,\n2006.\n\n[2] R. Oeuvray and M. Bierlaire. A new derivative-free algorithm for the medical image registra-\n\ntion problem. International Journal of Modelling and Simulation, 27(2):115\u2013124, 2007.\n\n[3] A.R. Conn, K. Scheinberg, and L.N. Vicente.\n\nIntroduction to derivative-free optimization,\n\nvolume 8. Society for Industrial Mathematics, 2009.\n\n[4] Warren B. Powell and Ilya O. Ryzhov. Optimal Learning. John Wiley and Sons, 2012.\n[5] Y. Nesterov. Random gradient-free minimization of convex functions. CORE Discussion\n\nPapers, 2011.\n\n[6] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger. Gaussian process optimization in the\n\nbandit setting: No regret and experimental design. Arxiv preprint arXiv:0912.3995, 2009.\n\n[7] R. Storn and K. Price. Differential evolution\u2013a simple and ef\ufb01cient heuristic for global opti-\n\nmization over continuous spaces. Journal of global optimization, 11(4):341\u2013359, 1997.\n\n[8] A. Agarwal, D.P. Foster, D. Hsu, S.M. Kakade, and A. Rakhlin. Stochastic convex optimization\n\nwith bandit feedback. Arxiv preprint arXiv:1107.1744, 2011.\n\n[9] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574, 2009.\n\n[10] V. Protasov. Algorithms for approximate calculation of the minimum of a convex function\n\nfrom its values. Mathematical Notes, 59:69\u201374, 1996. 10.1007/BF02312467.\n\n[11] M. Raginsky and A. Rakhlin. Information-based complexity, feedback, and dynamics in con-\n\nvex programming. Information Theory, IEEE Transactions on, (99):1\u20131, 2011.\n\n[12] L.L. Thurstone. A law of comparative judgment. Psychological Review; Psychological Review,\n\n34(4):273, 1927.\n\n[13] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem.\n\nJournal of Computer and System Sciences, 2012.\n\n[14] K.G. Jamieson and R.D. Nowak. Active ranking using pairwise comparisons. Neural Infor-\n\nmation Processing Systems (NIPS), 2011.\n\n[15] A.S. Nemirovsky and D.B. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\n1983.\n\n[16] A. Agarwal, D.P. Foster, D. Hsu, S.M. Kakade, and A. Rakhlin. Stochastic convex optimization\n\nwith bandit feedback. Arxiv preprint arXiv:1107.1744, 2011.\n\n[17] A. Agarwal, P.L. Bartlett, P. Ravikumar, and M.J. Wainwright. Information-theoretic lower\nbounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE\nTransactions on, (99):1\u20131, 2010.\n\n[18] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with\n\nmulti-point bandit feedback. In Conference on Learning Theory (COLT), 2010.\n\n[19] S. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic\n\nprogramming. 2012.\n\n[20] A.B. Tsybakov. Introduction to nonparametric estimation. Springer Verlag, 2009.\n[21] R.M. Castro and R.D. Nowak. Minimax bounds for active learning. Information Theory, IEEE\n\nTransactions on, 54(5):2339\u20132353, 2008.\n\n[22] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr, 2004.\n[23] R.P. Brent. Algorithms for minimization without derivatives. Dover Pubns, 2002.\n[24] M. K\u00a8a\u00a8ari\u00a8ainen. Active learning in the non-realizable case. In Algorithmic Learning Theory,\n\npages 63\u201377. Springer, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1247, "authors": [{"given_name": "Kevin", "family_name": "Jamieson", "institution": ""}, {"given_name": "Robert", "family_name": "Nowak", "institution": ""}, {"given_name": "Ben", "family_name": "Recht", "institution": ""}]}