{"title": "Improving the Expected Improvement Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 5381, "page_last": 5391, "abstract": "The expected improvement (EI) algorithm is a popular strategy for information collection in optimization under uncertainty. The algorithm is widely known to be too greedy, but nevertheless enjoys wide use due to its simplicity and ability to handle uncertainty and noise in a coherent decision theoretic framework. To provide rigorous insight into EI, we study its properties in a simple setting of Bayesian optimization where the domain consists of a finite grid of points. This is the so-called best-arm identification problem, where the goal is to allocate measurement effort wisely to confidently identify the best arm using a small number of measurements. In this framework, one can show formally that EI is far from optimal. To overcome this shortcoming, we introduce a simple modification of the expected improvement algorithm. Surprisingly, this simple change results in an algorithm that is asymptotically optimal for Gaussian best-arm identification problems, and provably outperforms standard EI by an order of magnitude.", "full_text": "Improving the Expected Improvement Algorithm\n\nChao Qin\n\nColumbia Business School\n\nNew York, NY 10027\n\ncqin22@gsb.columbia.edu\n\nDiego Klabjan\n\nNorthwestern University\n\nEvanston, IL 60208\n\nd-klabjan@northwestern.edu\n\nDaniel Russo\n\nColumbia Business School\n\nNew York, NY 10027\n\ndjr2174@gsb.columbia.edu\n\nAbstract\n\nThe expected improvement (EI) algorithm is a popular strategy for information\ncollection in optimization under uncertainty. The algorithm is widely known to\nbe too greedy, but nevertheless enjoys wide use due to its simplicity and ability\nto handle uncertainty and noise in a coherent decision theoretic framework. To\nprovide rigorous insight into EI, we study its properties in a simple setting of\nBayesian optimization where the domain consists of a \ufb01nite grid of points. This\nis the so-called best-arm identi\ufb01cation problem, where the goal is to allocate\nmeasurement effort wisely to con\ufb01dently identify the best arm using a small\nnumber of measurements. In this framework, one can show formally that EI is far\nfrom optimal. To overcome this shortcoming, we introduce a simple modi\ufb01cation\nof the expected improvement algorithm. Surprisingly, this simple change results in\nan algorithm that is asymptotically optimal for Gaussian best-arm identi\ufb01cation\nproblems, and provably outperforms standard EI by an order of magnitude.\n\n1\n\nIntroduction\n\nRecently Bayesian optimization has received much attention in the machine learning community\n[21]. This literature studies the problem of maximizing an unknown black-box objective function by\ncollecting noisy measurements of the function at carefully chosen sample points. At \ufb01rst a prior belief\nover the objective function is prescribed, and then the statistical model is re\ufb01ned sequentially as data\nare observed. Expected improvement (EI) [13] is one of the most widely-used Bayesian optimization\nalgorithms. It is a greedy improvement-based heuristic that samples the point offering greatest\nexpected improvement over the current best sampled point. EI is simple and readily implementable,\nand it offers reasonable performance in practice.\nAlthough EI is reasonably effective, it is too greedy, focusing nearly all sampling effort near the\nestimated optimum and gathering too little information about other regions in the domain. This\nphenomenon is most transparent in the simplest setting of Bayesian optimization where the function\u2019s\ndomain is a \ufb01nite grid of points. This is the problem of best-arm identi\ufb01cation (BAI) [1] in a multi-\narmed bandit. The player sequentially selects arms to measure and observes noisy reward samples\nwith the hope that a small number of measurements enable a con\ufb01dent identi\ufb01cation of the best\narm. Recently Ryzhov [20] studied the performance of EI in this setting. His work focuses on a link\nbetween EI and another algorithm known as the optimal computing budget allocation [3], but his\nanalysis reveals EI allocates a vanishing proportion of samples to suboptimal arms as the total number\nof samples grows. Any method with this property will be far from optimal in BAI problems [1].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we improve the EI algorithm dramatically through a simple modi\ufb01cation. The resulting\nalgorithm, which we call top-two expected improvement (TTEI), combines the top-two sampling\nidea of Russo [19] with a careful change to the improvement-measure used by EI. We show that\nthis simple variant of EI achieves strong asymptotic optimality properties in the BAI problem, and\nbenchmark the algorithm in simulation experiments.\nOur main theoretical contribution is a complete characterization of the asymptotic proportion of\nsamples TTEI allocates to each arm as a function of the true (unknown) arm means. These particular\nsampling proportions have been shown to be optimal from several perspectives [4, 12, 9, 19, 8], and\nthis enables us to establish two different optimality results for TTEI. The \ufb01rst concerns the rate at\nwhich the algorithm gains con\ufb01dence about the identity of the optimal arm as the total number of\nsamples collected grows. Next we study the so-called \ufb01xed con\ufb01dence setting, where the algorithm is\nable to stop at any point and return an estimate of the optimal arm. We show that when applied with\nthe stopping rule of Garivier and Kaufmann [8], TTEI essentially minimizes the expected number of\nsamples required among all rules obeying a constraint on the probability of incorrect selection.\nOne undesirable feature of our algorithm is its dependence on a tuning parameter. Our theoretical\nresults precisely show the impact of this parameter, and reveal a surprising degree of robustness to its\nvalue. It is also easy to design methods that adapt this parameter over time to the optimal value, and\nwe explore one such method in simulation. Still, removing this tuning parameter is an interesting\ndirection for future research.\n\nFurther related literature. Despite the popularity of EI, its theoretical properties are not well\nstudied. A notable exception is the work of Bull [2], who studies a global optimization problem and\nprovides a convergence rate for EI\u2019s expected loss. However, it is assumed that the observations\nare noiseless. Our work also relates to a large number of recent machine learning papers that try to\ncharacterize the sample complexity of the best-arm identi\ufb01cation problem [5, 18, 1, 7, 14, 10, 11, 15\u2013\n17]. Despite substantial progress, matching asymptotic upper and lower bounds remained elusive in\nthis line of work. Building on older work in statistics [4, 12] and simulation optimization [9], recent\nwork of Garivier and Kaufmann [8] and Russo [19] characterized the optimal sampling proportions.\nTwo notions of asymptotic optimality are established: sample complexity in the \ufb01xed con\ufb01dence\nsetting and rate of posterior convergence. Garivier and Kaufmann [8] developed two sampling\nrules designed to closely track the asymptotic optimal proportions and showed that, when combined\nwith a stopping rule motivated by Chernoff [4], this sampling rule minimizes the expected number\nof samples required to guarantee a vanishing threshold on the probability of incorrect selection is\nsatis\ufb01ed. Russo [19] independently proposed three simple Bayesian algorithms, and proved that\neach algorithm attains the optimal rate of posterior convergence. TTEI proposed in this paper is\nconceptually most similar to the top-two value sampling of Russo [19], but it is more computationally\nef\ufb01cient.\n\n1.1 Main Contributions\n\nAs discussed below, our work makes both theoretical and algorithmic contributions.\n\nTheoretical: Our main theoretical contribution is Theorem 1, which establishes that TTEI\u2013a simple\nmodi\ufb01cation to a popular Bayesian heuristic\u2013converges to the known optimal asymptotic\nsampling proportions. It is worth emphasizing that, unlike recent results for other top-two\nsampling algorithms [19], this theorem establishes that the expected time to converge to the\noptimal proportions is \ufb01nite, which we need to establish optimality in the \ufb01xed con\ufb01dence\nsetting. Proving this result required substantial technical innovations. Theorems 2 and 3\nare additional theoretical contributions. These mirror results in [19] and [8], but we extract\nminimal conditions on sampling rules that are suf\ufb01cient to guarantee the two notions of\noptimality studied in these papers.\n\nAlgorithmic: On the algorithmic side, we substantially improve a widely used algorithm. TTEI can\nbe easily implemented by modifying existing EI code, but, as shown in our experiments, can\noffer an order of magnitude improvement. A more subtle point involves the advantages of\nTTEI over algorithms that are designed to directly target convergence on the asymptotically\noptimal proportions. In the experiments, we show that TTEI substantially outperforms an\noracle sampling rule whose sampling proportions directly track the asymptotically optimal\nproportions. This phenomenon should be explored further in future work, but suggests that\n\n2\n\n\fby carefully reasoning about the value of information TTEI accounts for important factors\nthat are washed out in asymptotic analysis. Finally\u2013as discussed in the conclusion\u2013although\nwe focus on uncorrelated priors we believe our method can be easily extended to more\ncomplicated problems like that of best-arm identi\ufb01cation in linear bandits [22].\n\n2 Problem Formulation\nLet A = {1, . . . , k} be the set of arms. At each time n \u2208 N = {0, 1, 2, . . .}, an arm In \u2208 A is\nmeasured, and an independent noisy reward Yn,In is observed. The reward Yn,i \u2208 R of arm i at time\nn follows a normal distribution N (\u00b5i, \u03c32) with common known variance \u03c32, but unknown mean\n\u00b5i. The objective is to allocate measurement effort wisely in order to con\ufb01dently identify the arm\nwith highest mean using a small number of measurements. We assume that \u00b51 > \u00b52 > . . . > \u00b5k.\nOur analysis takes place in a frequentist setting, in which the true means (\u00b51, . . . , \u00b5k) are \ufb01xed but\nunknown. The algorithms we study, however, are Bayesian in the sense that they begin with prior\nover the arm means and update the belief to form a posterior distribution as evidence is gathered.\n\nPrior and Posterior Distributions. The sampling rules studied in this paper begin with a normally\ndistributed prior over the true mean of each arm i \u2208 A denoted by N (\u00b50,i, \u03c32\n0,i), and update this to\nform a posterior distribution as observations are gathered. By conjugacy, the posterior distribution\nafter observing the sequence (I0, Y0,I0, . . . , In\u22121, Yn\u22121,In\u22121 ) is also a normal distribution denoted\nby N (\u00b5n,i, \u03c32\nn,i). The posterior mean and variance can be calculated using the following recursive\nequations:\n\n\u00b5n+1,i =\n\nn,i\u00b5n,i + \u03c3\u22122Yn,i)/(\u03c3\u22122\n\nn,i + \u03c3\u22122)\n\n(cid:26)(\u03c3\u22122\n\n\u00b5n,i,\n\n(cid:26)1/(\u03c3\u22122\n\nif In = i,\nif In (cid:54)= i,\n\n.\n\nThe posterior probability assigned to the event that arm i is optimal is\n\n\u03b1n,i (cid:44) P\u03b8\u223c\u03a0n\n\n\u03b8i > max\nj(cid:54)=i\n\n\u03b8j\n\n.\n\n(1)\n\nTo avoid confusion, we always use \u03b8 = (\u03b81, . . . , \u03b8k) to denote a random vector of arm means drawn\nfrom the algorithm\u2019s posterior \u03a0n, and \u00b5 = (\u00b51, . . . , \u00b5k) to denote the vector of true arm means.\n\nTwo notions of asymptotic optimality. Our \ufb01rst notion of optimality relates to the rate of poste-\nrior convergence. As the number of observations grows, one hopes that the posterior distribution\nde\ufb01nitively identi\ufb01es the true best arm, in the sense that the posterior probability 1 \u2212 \u03b1n,1 assigned\nby the event that a different arm is optimal tends to zero. By sampling the arms intelligently, we\nhope this probability can be driven to zero as rapidly as possible. Following Russo [19], we aim to\nmaximize the exponent governing the rate of decay,\n\nn\u2192\u221e \u2212 1\n\nlim inf\n\nn\n\nlog (1 \u2212 \u03b1n,1) ,\n\namong all sampling rules.\nThe second setting we consider is often called the \u201c\ufb01xed con\ufb01dence\u201d setting. Here, the agent is\nallowed at any point to stop gathering samples and return an estimate of the identity of the optimal.\nIn addition to a sampling rule, we require a stopping rule that selects a time \u03c4 at which to stop, and\n\n3\n\nand\n\nif In = i,\nif In (cid:54)= i.\nWe denote the posterior distribution over the vector of arm means by\n\n\u03c32\nn+1,i =\n\n\u03c32\nn,i,\n\nn,i + \u03c3\u22122)\n\n\u03a0n = N (\u00b5n,1, \u03c32\n\nand let \u03b8 = (\u03b81, . . . , \u03b8k). For example, with this notation\n\nn,1) \u2297 N (\u00b5n,2, \u03c32\n(cid:35)\n\n(cid:34)(cid:88)\n\nn,2) \u2297 \u00b7\u00b7\u00b7 \u2297 N (\u00b5n,k, \u03c32\n(cid:88)\n\nn,k)\n\nE\u03b8\u223c\u03a0n\n\n=\n\n\u00b5n,i.\n\ni\u2208A\n\ni\u2208A\n\n\u03b8i\n\n(cid:18)\n\n(cid:19)\n\n\fa decision rule that returns an estimate \u02c6I\u03c4 of the optimal arm based on the \ufb01rst \u03c4 observations. We\nconsider minimizing the average number of observations E[\u03c4\u03b4] required by an algorithm (that consists\nof a sampling rule, a stopping rule and a decision rule) guaranteeing a vanishing probability \u03b4 of\nincorrect identi\ufb01cation, i.e., P( \u02c6I\u03c4\u03b4 (cid:54)= 1) \u2264 \u03b4. Following Garivier and Kaufmann [8], the number of\nsamples required scales with log(1/\u03b4), and so we aim to minimize\n\nE[\u03c4\u03b4]\nlog(1/\u03b4)\n\nlim sup\n\n\u03b4\u21920\n\namong all algorithms with probability of error no more than \u03b4. In this setting, we study the perfor-\nmance of sampling rules when combined with the stopping rule studied by Chernoff [4] and Garivier\nand Kaufmann [8].\n\n3 Sampling Rules\n\nIn this section, we \ufb01rst introduce the expected improvement algorithm, and point out its weakness.\nThen a simple variant of the expected improvement algorithm is proposed. Both algorithms make\ncalculations using function f (x) = x\u03a6(x) + \u03c6(x) where \u03a6(\u00b7) and \u03c6(\u00b7) are the CDF and PDF of\nthe standard normal distribution. One can show that as x \u2192 \u221e, log f (\u2212x) \u223c \u2212x2/2, and so\nf (\u2212x) \u2248 e\u2212x2/2 for very large x. One can also show that f is an increasing function.\n\nExpected Improvement. Expected improvement [13] is a simple improvement-based sampling\nrule. The EI algorithm favors the arm that offers the largest amount of improvement upon a target.\nThe EI algorithm measures the arm In = arg maxi\u2208A vn,i where vn,i is the EI value of arm i at time\nn. Let I\u2217\nn = arg maxi\u2208A \u00b5n,i denote the arm with largest posterior mean at time n. The EI value of\narm i at time n is de\ufb01ned as\n\nvn,i (cid:44) E\u03b8\u223c\u03a0n\n\n(cid:18) \u00b5n,i \u2212 \u00b5n,I\u2217\n\nn\n\n(cid:19)\n\n\u03c3n,i\n\n(cid:104)(cid:0)\u03b8i \u2212 \u00b5n,I\u2217\n(cid:1)+(cid:105)\n(cid:18) \u00b5n,i \u2212 \u00b5n,I\u2217\n\n.\n\nn\n\nn\n\n(cid:19)\n\n\u03c3n,i\n\n+ \u03c3n,i\u03c6\n\n= \u03c3n,if\n\n(cid:18) \u00b5n,i \u2212 \u00b5n,I\u2217\n\nn\n\n(cid:19)\n\n.\n\n\u03c3n,i\n\nwhere x+ = max{x, 0}. The above expectation can be computed analytically as follows,\n\nvn,i =(cid:0)\u00b5n,i \u2212 \u00b5n,I\u2217\n\n(cid:1) \u03a6\n\nn\n\nThe EI value vn,i measures the potential of arm i to improve upon the largest posterior mean \u00b5n,I\u2217\nn at\ntime n. Because f is an increasing function, vn,i is increasing in both the posterior mean \u00b5n,i and\nposterior standard deviation \u03c3n,i.\n\nTop-Two Expected Improvement. The EI algorithm can have very poor performance for selecting\nthe best arm. Once the posterior indicates a particular arm is the best with reasonably high probability,\nEI allocates nearly all future samples to this arm at the expense of measuring other arms. Recently\nRyzhov [20] showed that EI only allocates O(log n) samples to suboptimal arms asymptotically.\nThis is a severe shortcoming, as it means n must be extremely large before the algorithm has enough\nsamples from suboptimal arms to reach a con\ufb01dent conclusion.\nTo improve the EI algorithm, we build on the top-two sampling idea in Russo [19]. The idea is to\nidentify in each period the two \u201cmost promising\u201d arms based on current observations, and randomize\nto choose which to sample. A tuning parameter \u03b2 \u2208 (0, 1) controls the probability assigned to the\n\u201ctop\u201d arm. A naive top-two variant of EI would identify the two arms with largest EI value, and \ufb02ip\na \u03b2\u2013weighted coin to decide which to measure. However, one can prove that this algorithm is not\noptimal for any choice of \u03b2. Instead, what we call the top-two expected improvement algorithm uses\na novel modi\ufb01ed EI criterion which more carefully accounts for the decision-maker\u2019s uncertainty\nwhen deciding which arm to sample.\nFor i, j \u2208 A, de\ufb01ne vn,i,j (cid:44) E\u03b8\u223c\u03a0n [(\u03b8i \u2212 \u03b8j)+]. This measures the expected magnitude of\nimprovement arm i offers over arm j, but unlike the typical EI criterion, this expectation integrates\nover the uncertain quality of both arms. This measure can be computed analytically as\n\n(cid:113)\n\nvn,i,j =\n\n\u03c32\nn,i + \u03c32\n\nn,jf\n\n\uf8eb\uf8ed \u00b5n,i \u2212 \u00b5n,j\n(cid:113)\n\n\u03c32\nn,i + \u03c32\nn,j\n\n\uf8f6\uf8f8 .\n\n4\n\n\f2 ,\u00b7\u00b7\u00b7 , w\u03b2\n(\u00b51 \u2212 \u00b5k)2\n1/\u03b2 + 1/w\u03b2\nk\n\n= . . . =\n\n.\n\n(2)\n\nTTEI depends on a tuning parameter \u03b2 > 0, set to 1/2 by default. With probability \u03b2, TTEI measures\nthe arm I (1)\nn that offers\nthe largest expected improvement on the arm I (1)\n\nn by optimizing the EI criterion, and otherwise it measures an alternative I (2)\n\nn . Formally, TTEI measures the arm\n\n(cid:40)\n\nIn =\n\nI (1)\nn = arg maxi\u2208A vn,i,\nI (2)\nn = arg maxi\u2208A vn,i,I (1)\n\nn\n\nwith probability \u03b2,\n\n, with probability 1 \u2212 \u03b2.\n\nn (cid:54)= I (1)\nn .\n\nNote that vn,i,i = 0, which implies I (2)\nWe notice that TTEI with \u03b2 = 1 is the standard EI algorithm. Comparing to the EI algorithm, TTEI\nwith \u03b2 \u2208 (0, 1) allocates much more measurement effort to suboptimal arms. We will see that TTEI\nallocates \u03b2 proportion of samples to the best arm asymptotically, and it uses the remaining 1 \u2212 \u03b2\nfraction of samples for gathering evidence against each suboptimal arm.\n\n4 Convergence to Asymptotically Optimal Proportions\n\nFor all i \u2208 A and n \u2208 N, we de\ufb01ne Tn,i (cid:44)(cid:80)n\u22121\n\n(cid:96)=0 1{I(cid:96) = i} to be the number of samples of arm\ni before time n. We will show that under TTEI with parameter \u03b2, limn\u2192\u221e Tn,1/n = \u03b2. That is,\nthe algorithm asymptotically allocates \u03b2 proportion of the samples to true best arm. Dropping for\nthe moment questions regarding the impact of this tuning parameter, let us consider the optimal\nasymptotic proportion of effort to allocate to each of the k \u2212 1 remaining arms. It is known that the\ni = 1 \u2212 \u03b2 and\noptimal proportions are given by the unique vector (w\u03b2\n\nk ) satisfying(cid:80)k\n\ni=2 w\u03b2\n\n(cid:16)\n\nWe set w\u03b2\n\n1 = \u03b2, so w\u03b2 =\n\n(\u00b51 \u2212 \u00b52)2\n1/\u03b2 + 1/w\u03b2\n2\n\n(cid:17)\n\n(cid:16)\n\n\u02c6\u00b5n,1 \u2212 \u02c6\u00b5n,i \u223c N(cid:0)\u00b51 \u2212 \u00b5i, \u02dc\u03c32\n\n\u00b5i, \u03c32\nw\u03b2\ni n\n\ni\n\nw\u03b2\n1 , . . . , w\u03b2\n\nk\n\nencodes the sampling proportions of each arm.\n\nTo understand the source of equation (2), imagine that over the \ufb01rst n periods each arm i is sampled\nexactly w\u03b2\n\ndenote the empirical mean of arm i. Then\n\ni n times, and let \u02c6\u00b5n,i \u223c N\n\n(cid:17)\n(cid:1) where\n\n(cid:32)\n\n(cid:33)\n\n.\n\n\u02dc\u03c32\ni =\n\n\u03c32\nn\n\n1\n\u03b2\n\n+\n\n1\nw\u03b2\ni\n\nThe probability \u02c6\u00b5n,1 \u2212 \u02c6\u00b5n,i \u2264 0\u2013leading to an incorrect estimate of which arm has highest mean\u2013is\n\u03a6 ((\u00b5i \u2212 \u00b51)/\u02dc\u03c3i) where \u03a6 is the CDF of the standard normal distribution. Equation (2) is equivalent\nto requiring (\u00b51 \u2212 \u00b5i)/\u02dc\u03c3i is equal for all arms i, so the probability of falsely declaring \u00b5i \u2265 \u00b51\nis equal for all i (cid:54)= 1. In a sense, these sampling frequencies equalize the evidence against each\nsuboptimal arm. These proportions appeared \ufb01rst in the machine learning literature in [19, 8], but\nappeared much earlier in the statistics literature in [12], and separately in the simulation optimization\nliterature in [9]. As we will see in the next section, convergence to this allocation is a necessary\ncondition for both notions of optimality considered in this paper.\nOur main theoretical contribution is the following theorem, which establishes that under TTEI\nsampling proportions converge to the proportions w\u03b2 derived above. Therefore, while the sampling\nproportion of the optimal arm is controlled by the tuning parameter \u03b2, the remaining 1\u2212 \u03b2 fraction of\nmeasurement is optimally distributed among the remaining k \u2212 1 arms. Such a result was established\nfor other top-two sampling algorithms in [19]. The second notion of optimality requires not just\nconvergence to w\u03b2 with probability 1, but also a sense in which the expected time until convergence\nis \ufb01nite. The following theorem presents such a stronger result for TTEI. To make this precise,\nwe introduce a time after which for each arm, the empirical proportion allocated to it is accurate.\nSpeci\ufb01cally, given \u03b2 \u2208 (0, 1) and \u0001 > 0, we de\ufb01ne\n\n(cid:44) inf\n\nN \u2208 N : max\ni\u2208A\n\nM \u0001\n\u03b2\n\u03b2 < \u221e) = 1 for all \u0001 > 0 if and only if Tn,i/n \u2192 w\u03b2\n\n|Tn,i/n \u2212 w\u03b2\n\ni | \u2264 \u0001 \u2200n \u2265 N\n\nIt is clear that P(M \u0001\ni with probability 1 for each\narm i \u2208 A. To establish optimality in the \u201c\ufb01xed con\ufb01dence setting\u201d, we need to prove in addition that\nE[M \u0001\n\n\u03b2] < \u221e for all \u0001 > 0, which requires substantial new technical innovations.\n\n.\n\n(3)\n\n(cid:26)\n\n(cid:27)\n\n5\n\n\fTheorem 1. Under TTEI with parameter \u03b2 \u2208 (0, 1), E[M \u0001\nThis result implies that under TTEI, P(M \u0001\n\n\u03b2] < \u221e for any \u0001 > 0.\n\n\u03b2 < \u221e) = 1 for all \u0001 > 0, or equivalently\nTn,i\nn\n\n\u2200i \u2208 A.\n\n= w\u03b2\ni\n\nlim\nn\u2192\u221e\n\n4.1 Problem Complexity Measure\nGiven \u03b2 \u2208 (0, 1), de\ufb01ne the problem complexity measure\n\n(cid:16)\n\n(\u00b51 \u2212 \u00b52)2\n1/\u03b2 + 1/w\u03b2\n2\n\n(cid:17) = . . . =\n\n(cid:16)\n\n(\u00b51 \u2212 \u00b5k)2\n1/\u03b2 + 1/w\u03b2\nk\n\n(cid:17) ,\n\n2\u03c32\n\n\u0393\u2217\n\n\u03b2\n\n(cid:44)\n\n2\u03c32\n\n\u03b2 and \u03b2\u2217 = arg max\u03b2\u2208(0,1) \u0393\u2217\n\nwhich is a function of the true arm means and variances. This will be the exponent governing\nthe rate of posterior convergence, and also characterizing the average number of samples in the\n\ufb01xed con\ufb01dence stetting. The optimal exponent comes from maximizing over \u03b2. Let us de\ufb01ne\n\u0393\u2217 = max\u03b2\u2208(0,1) \u0393\u2217\n\n(cid:16)\n\u03b2 and set\n\u03b2\u2217, w\u03b2\u2217\n=\nRusso [19] has proved that for \u03b2 \u2208 (0, 1), \u0393\u2217\n1/2 \u2265 \u0393\u2217/2.\n\u03b2 \u2265 \u0393\u2217/ max\nThis demonstrates a surprising degree of robustness to \u03b2. In particular, \u0393\u03b2 is close to \u0393\u2217 if \u03b2 is\nadjusted to be close to \u03b2\u2217, and the choice of \u03b2 = 1/2 always yields a 2-approximation to \u0393\u2217.\n\n(cid:17)\n(cid:110) \u03b2\u2217\n\u03b2 , 1\u2212\u03b2\u2217\n1\u2212\u03b2\n\n, and therefore \u0393\u2217\n\n2 , . . . , w\u03b2\u2217\n\nw\u2217 = w\u03b2\u2217\n\n.\n\nk\n\n(cid:111)\n\n5\n\nImplied Optimality Results\n\nThis section establishes formal optimality guarantees for TTEI. Both results, in fact, hold for any\nalgorithm satisfying the conclusions of Theorem 1, and are therefore of broader interest.\n\n5.1 Optimal Rate of Posterior Convergence\n\nWe \ufb01rst provide upper and lower bounds on the exponent governing the rate of posterior convergence.\nThe same result has been has been proved in Russo [19] for bounded correlated priors. We use\ndifferent proof techniques to prove the following result for uncorrelated Gaussian priors.\nThis theorem shows that no algorithm can attain a rate of posterior convergence faster than e\u2212\u0393\u2217n\nand that this is attained by any algorithm that, like TTEI with optimal tuning parameter \u03b2\u2217, has\nasymptotic sampling ratios (w\u2217\nk). The second part implies TTEI with parameter \u03b2 attains\nconvergence rate e\u2212n\u0393\u2217\n\u03b2 and that it is optimal among sampling rules that allocation \u03b2\u2013fraction of\nsamples to the optimal arm. Recall that, without loss of generality, we have assumed arm 1 is the arm\nwith true highest mean \u00b51 = maxi\u2208A \u00b5i. We will study the posterior mass 1 \u2212 \u03b1n,1 assigned to the\nevent that some other has the highest mean.\nTheorem 2 (Posterior Convergence - Suf\ufb01cient Condition for Optimality). The following properties\nhold with probability 1:\n\n1, . . . , w\u2217\n\n1. Under any sampling rule that satis\ufb01es Tn,i/n \u2192 w\u2217\n\ni for each i \u2208 A,\n\nn\u2192\u221e \u2212 1\n\nlim\n\nn\n\nlog (1 \u2212 \u03b1n,1) = \u0393\u2217.\n\nUnder any sampling rule,\n\nlim sup\nn\u2192\u221e\n\n\u2212 1\nn\n\nlog(1 \u2212 \u03b1n,1) \u2264 \u0393\u2217.\n\n2. Let \u03b2 \u2208 (0, 1). Under any sampling rule that satis\ufb01es Tn,i/n \u2192 w\u03b2\n\ni for each i \u2208 A,\n\nn\u2192\u221e \u2212 1\n\nlim\n\nn\n\nlog(1 \u2212 \u03b1n,1) = \u0393\u2217\n\u03b2.\n\n6\n\n\fUnder any sampling rule that satis\ufb01es Tn,1/n \u2192 \u03b2,\n\nlim sup\nn\u2192\u221e\n\n\u2212 1\nn\n\nlog(1 \u2212 \u03b1n,1) \u2264 \u0393\u2217\n\u03b2.\n\nThis result reveals that when the tuning parameter \u03b2 is set optimally to \u03b2\u2217, TTEI attains the optimal\n1/2 \u2265 \u0393\u2217/2, when \u03b2 is set to the default value 1/2, the\nrate of posterior convergence. Since \u0393\u2217\nexponent governing the convergence rate of TTEI is at least half of the optimal one.\n\n5.2 Optimal Average Sample Size\n\nChernoff\u2019s Stopping Rule.\nIn the \ufb01xed con\ufb01dence setting, besides an ef\ufb01cient sampling rule, a\nplayer also needs to design an intelligent stopping rule. This section introduces a stopping rule\nproposed by Chernoff [4] and studied recently by Garivier and Kaufmann [8]. This stopping rule\n(cid:80)n\u22121\nmakes use of the Generalized Likelihood Ratio statistic, which depends on the current maximum\nlikelihood estimates of all unknown means. For each arm i \u2208 A, the maximum likelihood estimate\n(cid:96)=0 1{I(cid:96) = i}Y(cid:96),I(cid:96) where\nof its unknown mean \u00b5i at time n is its empirical mean \u02c6\u00b5n,i = T \u22121\n(cid:96)=0 1{I(cid:96) = i}. Next we de\ufb01ne a weighted average of empirical means of arms i, j \u2208 A:\n\nTn,i =(cid:80)n\u22121\n\nn,i\n\n\u02c6\u00b5n,i,j (cid:44)\n\nTn,i\n\nTn,i + Tn,j\n\n\u02c6\u00b5n,i +\n\nTn,j\n\nTn,i + Tn,j\n\n\u02c6\u00b5n,j.\n\nThen if \u02c6\u00b5n,i \u2265 \u02c6\u00b5n,j, the Generalized Likelihood Ratio statistic Zn,i,j has the following explicit\nexpression:\n\nZn,i,j (cid:44) Tn,id(\u02c6\u00b5n,i, \u02c6\u00b5n,i,j) + Tn,jd(\u02c6\u00b5n,j, \u02c6\u00b5n,i,j)\n\nwhere d(x, y) = (x \u2212 y)2/(2\u03c32) is the Kullback-Leibler (KL) divergence between Gaussian distribu-\ntions N (x, \u03c32) and N (y, \u03c32). Similarly, if \u02c6\u00b5n,i < \u02c6\u00b5n,j, Zn,i,j = \u2212Zn,j,i \u2264 0 where Zn,j,i is well\nde\ufb01ned as above. If either arm has never been sampled before, these quantities are not well de\ufb01ned\nand we take the convention that Zn,i,j = Zn,j,i = 0. Given a target con\ufb01dence \u03b4 \u2208 (0, 1), to ensure\nthat one arm is better than the others with probability at least 1 \u2212 \u03b4, we use the stopping time\n\n(cid:27)\n\n\u03c4\u03b4 (cid:44) inf\n\nn \u2208 N : Zn (cid:44) max\ni\u2208A\n\nmin\nj\u2208A\\{i} Zn,i,j > \u03b3n,\u03b4\n\n(cid:26)\n\n(cid:18)\n\nn} Zn, \u02c6I\u2217\n\nn,j.\n\nwhere \u03b3n,\u03b4 > 0 is an appropriate threshold. By de\ufb01nition, minj\u2208A\\{i} Zn,i,j is nonnegative if\nand only if \u02c6\u00b5n,i \u2265 \u02c6\u00b5n,j for all j \u2208 A \\ {i}. Hence, whenever \u02c6I\u2217\n(cid:44) arg maxi\u2208A \u02c6\u00b5n,i is unique,\nZn = minj\u2208A\\{ \u02c6I\u2217\nNext we introduce the exploration rate for normal bandit models that can ensure to identify the best\narm with probability at least 1 \u2212 \u03b4. We use the following result given in Garivier and Kaufmann [8].\nProposition 1 (Garivier and Kaufmann [8] Proposition 12). Let \u03b4 \u2208 (0, 1) and \u03b1 > 1. There exists a\nconstant C = C(\u03b1, k) such that under any sampling rule, using the Chernoff\u2019s stopping rule with\nthe threshold \u03b3\u03b1\n\nn,\u03b4 = log(Cn\u03b1/\u03b4) guarantees\n\nn\n\n(cid:19)\n\u02c6\u00b5\u03c4\u03b4,i (cid:54)= 1\n\n\u2264 \u03b4.\n\nP\n\n\u03c4\u03b4 < \u221e, arg max\n\ni\u2208A\n\nSample Complexity. Garivier and Kaufmann [8] recently provided a general lower bound on the\nnumber of samples required in the \ufb01xed con\ufb01dence setting. In particular, they show that for any\nnormal bandit model, under any sampling rule and stopping time \u03c4\u03b4 that guarantees a probability of\nerror no more than \u03b4,\n\nlim inf\n\u03b4\u21920\n\nE[\u03c4\u03b4]\nlog(1/\u03b4)\n\n\u2265 1\n\u0393\u2217 .\n\nRecall that M \u0001\nof their asymptotic limits. The next result provides a condition in terms of M \u0001\nguarantee optimality in the \ufb01xed con\ufb01dence setting.\n\n\u03b2, de\ufb01ned in (3), is the \ufb01rst time after which the empirical proportions are within \u0001\n\u03b2 that is suf\ufb01cient to\n\n7\n\n\fTheorem 3 (Fixed Con\ufb01dence - Suf\ufb01cient Condition for Optimality). Let \u03b4, \u03b2 \u2208 (0, 1) and \u03b1 > 1.\n\u03b2] < \u221e for all \u0001 > 0,\nUnder any sampling rule which, if applied with no stopping rule, satis\ufb01es E[M \u0001\nn,\u03b4 = log(Cn\u03b1/\u03b4) (where C = C(\u03b1, k))\nusing the Chernoff\u2019s stopping rule with the threshold \u03b3\u03b1\nguarantees\n\u2264 1\n\u0393\u2217\n\nE[\u03c4\u03b4]\nlog(1/\u03b4)\n\n.\n\nlim sup\n\n\u03b4\u21920\n\n\u03b2\n\nWhen \u03b2 = \u03b2\u2217 the general lower bound on sample complexity of 1/\u0393\u2217 is essentially matched. In\naddition, when \u03b2 is set to the default value 1/2, the sample complexity of TTEI combined with the\nChernoff\u2019s stopping rule is at most twice the optimal sample complexity since 1/\u0393\u2217\n\n1/2 \u2264 2/\u0393\u2217.\n\n6 Numerical Experiments\n\nTo test the empirical performance of TTEI, we conduct several numerical experiments. The \ufb01rst\nexperiment compares the performance of TTEI with \u03b2 = 1/2 and EI. The second experiment\ncompares the performance of different versions of TTEI, top-two Thompson sampling (TTTS) [19],\nknowledge gradient (KG) [6] and oracle algorithms that know the optimal proportions a priori. Each\nalgorithm plays arm i = 1, . . . , k exactly once at the beginning, and then prescribe a prior N (Yi,i, \u03c32)\nfor unknown arm-mean \u00b5i where Yi,i is the observation from N (\u00b5i, \u03c32). In both experiments, we \ufb01x\nthe common known variance \u03c32 = 1 and the number of arms k = 5. We consider three instances\n[\u00b51, . . . , \u00b55] = [5, 4, 1, 1, 1], [5, 4, 3, 2, 1] and [2, 0.8, 0.6, 0.4, 0.2]. The optimal parameter \u03b2\u2217 equals\n0.48, 0.45 and 0.35, respectively.\nRecall that \u03b1n,i, de\ufb01ned in (1), denotes the posterior probability that arm i is optimal. Tables 1 and 2\nshow the average number of measurements required for the largest posterior probability assigned\nto some arm being the best to reach a given con\ufb01dence level c, i.e., maxi \u03b1n,i \u2265 c. In a Bayesian\nsetting, the probability of correct selection under this rule is exactly c. The results in Table 1 are\naveraged over 100 trials. We see that TTEI with \u03b2 = 1/2 outperforms standard EI by an order of\nmagnitude.\n\nTable 1: Average number of measurements required to reach the con\ufb01dence level c = 0.95\n\n[5, 4, 1, 1, 1]\n[5, 4, 3, 2, 1]\n[2, .8, .6, .4, .2]\n\nTTEI-1/2\n14.60\n16.72\n24.39\n\nEI\n238.50\n384.73\n1525.42\n\nThe second experiment compares the performance of different versions of TTEI, TTTS, KG, a random\nsampling oracle (RSO) and a tracking oracle (TO). The random sampling oracle draws a random arm\nin each round from the distribution w\u2217 encoding the asymptotically optimal proportions. The tracking\noracle tracks the optimal proportions at each round. Speci\ufb01cally, the tracking oracle samples the arm\nwith the largest ratio its optimal and empirical proportions. Two tracking algorithms proposed by\nGarivier and Kaufmann [8] are similar to this tracking oracle. TTEI with adaptive \u03b2 (aTTEI) works\nas follows: it starts with \u03b2 = 1/2 and updates \u03b2 = \u02c6\u03b2\u2217 every 10 rounds where \u02c6\u03b2\u2217 is the maximizer of\nequation (2) based on plug-in estimators for the unknown arm-means. Table 2 shows the average\nnumber of measurements required for the largest posterior probability being the best to reach the\ncon\ufb01dence level c = 0.9999. The results in Table 2 are averaged over 200 trials. We see that the\nperformances of TTEI with adaptive \u03b2 and TTEI with \u03b2\u2217 are better than the performances of all other\nalgorithms. We note that TTEI with adaptive \u03b2 substantially outperforms the tracking oracle.\n\nTable 2: Average number of measurements required to reach the con\ufb01dence level c = 0.9999\n\n[5, 4, 1, 1, 1]\n[5, 4, 3, 2, 1]\n[2, .8, .6, .4, .2]\n\nTTEI-1/2\n61.97\n66.56\n76.21\n\naTTEI TTEI-\u03b2\u2217\n61.59\n61.98\n65.54\n65.55\n71.62\n72.94\n\nTTTS-\u03b2\u2217\n62.86\n66.53\n73.02\n\nRSO\n97.04\n103.43\n101.97\n\nTO\n77.76\n88.02\n96.90\n\nKG\n75.55\n81.49\n86.98\n\nIn addition to the Bayesian stopping rule tested above, we have run some experiments with the\nChernoff stopping rule discussed in Section 5.2. Asymptotic analysis shows these two rules are\n\n8\n\n\fsimilar when the con\ufb01dence level c is very high. However, the Chernoff stopping rule appears to be\ntoo conservative in practice; it typically yields a probability of correct selection much larger than\nthe speci\ufb01ed con\ufb01dence level c at the expense of using more samples. Since our current focus is on\nallocation rules, we focus on this Bayesian stopping rule, which appears to offer a more fundamental\ncomparison than one based on ad hoc choice of tuning parameters. Developing improved stopping\nrules is an important area for future research.\n\n7 Conclusion and Extensions to Correlated Arms\n\nWe conclude by noting that while this paper thoroughly studies TTEI in the case of uncorrelated\npriors, we believe the algorithm is also ideally suited to problems with complex correlated priors\nand large sets of arms. In fact, the modi\ufb01ed information measure vn,i,j was designed with an eye\ntoward dealing with correlation in a sophisticated way. In the case of a correlated normal distribution\nN (\u00b5, \u03a3), one has\n\nvn,i,j = E\u03b8\u223cN (\u00b5,\u03a3)[(\u03b8i \u2212 \u03b8j)+] =(cid:112)\u03a3ii + \u03a3jj \u2212 2\u03a3ijf\n\n(cid:32)\n\n(cid:112)\u03a3ii + \u03a3jj \u2212 2\u03a3ij\n\n\u00b5n,i \u2212 \u00b5n,j\n\n(cid:33)\n\n.\n\nThis closed form accommodates ef\ufb01cient computation. Here the term \u03a3i,j accounts for the correlation\nor similarity between arms i and j. Therefore vn,i,I (1)\nis large for arms i that offer large potential\nimprovement over I (1)\nn , i.e. those that (1) have large posterior mean, (2) have large posterior variance,\nand (3) are not highly correlated with arm I (1)\nn concentrates near the estimated optimum, we\nexpect the third factor will force the algorithm to experiment in promising regions of the domain that\nare \u201cfar\u201d away from the current-estimated optimum, and are under-explored under standard EI.\n\nn . As I (1)\n\nn\n\n9\n\n\fReferences\n[1] Jean-Yves Audibert, S\u00e9bastien Bubeck, and R\u00e9mi Munos. Best arm identi\ufb01cation in multi-\narmed bandits. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June\n27-29, 2010, pages 41\u201353, 2010.\n\n[2] Adam D. Bull. Convergence rates of ef\ufb01cient global optimization algorithms. Journal of\nMachine Learning Research, 12:2879\u20132904, 2011. URL http://dblp.uni-trier.de/db/\njournals/jmlr/jmlr12.html#Bull11.\n\n[3] Chun-Hung Chen, Jianwu Lin, Enver Y\u00fccesan, and Stephen E Chick. Simulation budget\nallocation for further enhancing the ef\ufb01ciency of ordinal optimization. Discrete Event Dynamic\nSystems, 10(3):251\u2013270, 2000.\n\n[4] Herman Chernoff. Sequential design of experiments. Ann. Math. Statist., 30(3):755\u2013770,\n09 1959. doi: 10.1214/aoms/1177706205. URL http://dx.doi.org/10.1214/aoms/\n1177706205.\n\n[5] Eyal Even-dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and\nmarkov decision processes. In In Fifteenth Annual Conference on Computational Learning\nTheory (COLT), pages 255\u2013270, 2002.\n\n[6] Peter I Frazier, Warren B Powell, and Savas Dayanik. A knowledge-gradient policy for\nsequential information collection. SIAM Journal on Control and Optimization, 47(5):2410\u2013\n2439, 2008.\n\n[7] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identi\ufb01cation: A\nuni\ufb01ed approach to \ufb01xed budget and \ufb01xed con\ufb01dence. In F. Pereira, C. J. C. Burges, L. Bottou,\nand K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages\n3212\u20133220. Curran Associates, Inc., 2012.\n\n[8] Aur\u00e9lien Garivier and Emilie Kaufmann. Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence.\nIn Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June\n23-26, 2016, pages 998\u20131027, 2016.\n\n[9] P. Glynn and S. Juneja. A large deviations perspective on ordinal optimization. In Simulation\n\nConference, 2004. Proceedings of the 2004 Winter, volume 1. IEEE, 2004.\n\n[10] Kevin Jamieson, Matthew Malloy, Robert Nowak, and S\u00e9bastien Bubeck. lil\u2019 ucb : An optimal\nexploration algorithm for multi-armed bandits. In Maria Florina Balcan, Vitaly Feldman, and\nCsaba Szepesv\u00e1ri, editors, Proceedings of The 27th Conference on Learning Theory, volume 35\nof Proceedings of Machine Learning Research, pages 423\u2013439, Barcelona, Spain, 13\u201315 Jun\n2014. PMLR. URL http://proceedings.mlr.press/v35/jamieson14.html.\n\n[11] Kevin G. Jamieson and Robert D. Nowak. Best-arm identi\ufb01cation algorithms for multi-armed\nbandits in the \ufb01xed con\ufb01dence setting. In 48th Annual Conference on Information Sciences and\nSystems, CISS 2014, Princeton, NJ, USA, March 19-21, 2014, pages 1\u20136, 2014.\n\n[12] C. Jennison, I. M. Johnstone, and B. W. Turnbull. Asymptotically optimal procedures for\nsequential adaptive selection of the best of several normal means. Statistical decision theory\nand related topics III, 2:55\u201386, 1982.\n\n[13] Donald R. Jones, Matthias Schonlau, and William J. Welch. Ef\ufb01cient global optimization\nof expensive black-box functions. Journal of Global Optimization, 13(4):455\u2013492, 1998.\nISSN 1573-2916. doi: 10.1023/A:1008306431147. URL http://dx.doi.org/10.1023/A:\n1008306431147.\n\n[14] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed\nbandits. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th Inter-\nnational Conference on Machine Learning, volume 28 of Proceedings of Machine Learn-\ning Research, pages 1238\u20131246, Atlanta, Georgia, USA, 17\u201319 Jun 2013. PMLR. URL\nhttp://proceedings.mlr.press/v28/karnin13.html.\n\n10\n\n\f[15] Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset\nselection. In Shai Shalev-Shwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual\nConference on Learning Theory, volume 30 of Proceedings of Machine Learning Research,\npages 228\u2013251, Princeton, NJ, USA, 12\u201314 Jun 2013. PMLR. URL http://proceedings.\nmlr.press/v30/Kaufmann13.html.\n\n[16] Emilie Kaufmann, Olivier Capp\u00e9, and Aur\u00e9lien Garivier. On the complexity of a/b testing. In\nMaria Florina Balcan, Vitaly Feldman, and Csaba Szepesv\u00e1ri, editors, Proceedings of The 27th\nConference on Learning Theory, volume 35 of Proceedings of Machine Learning Research,\npages 461\u2013481, Barcelona, Spain, 13\u201315 Jun 2014. PMLR. URL http://proceedings.mlr.\npress/v35/kaufmann14.html.\n\n[17] Emilie Kaufmann, Olivier Capp\u00e9, and Aur\u00e9lien Garivier. On the complexity of best-arm\nidenti\ufb01cation in multi-armed bandit models. Journal of Machine Learning Research, 17(1):\n1\u201342, 2016. URL http://jmlr.org/papers/v17/kaufman16a.html.\n\n[18] Shie Mannor, John N. Tsitsiklis, Kristin Bennett, and Nicol\u00f2 Cesa-bianchi. The sample\ncomplexity of exploration in the multi-armed bandit problem. Journal of Machine Learning\nResearch, 5:2004, 2004.\n\n[19] Daniel Russo. Simple bayesian algorithms for best arm identi\ufb01cation. In 29th Annual Conference\n\non Learning Theory, pages 1417\u20131418, 2016.\n\n[20] Ilya O. Ryzhov. On the convergence rates of expected improvement methods. Operations\nResearch, 64(6):1515\u20131528, 2016. doi: 10.1287/opre.2016.1494. URL http://dx.doi.org/\n10.1287/opre.2016.1494.\n\n[21] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the\nhuman out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):\n148\u2013175, 2016. doi: 10.1109/JPROC.2015.2494218. URL http://dx.doi.org/10.1109/\nJPROC.2015.2494218.\n\n[22] Marta Soare, Alessandro Lazaric, and R\u00e9mi Munos. Best-arm identi\ufb01cation in linear bandits.\n\nIn Advances in Neural Information Processing Systems, pages 828\u2013836, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2789, "authors": [{"given_name": "Chao", "family_name": "Qin", "institution": "Columbia University"}, {"given_name": "Diego", "family_name": "Klabjan", "institution": "Northwestern University"}, {"given_name": "Daniel", "family_name": "Russo", "institution": "Columbia University"}]}