{"title": "Adaptive Concentration Inequalities for Sequential Decision Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1343, "page_last": 1351, "abstract": "A key challenge in sequential decision problems is to determine how many samples are needed for an agent to make reliable decisions with good probabilistic guarantees.   We introduce Hoeffding-like concentration inequalities that hold for a random, adaptively chosen number of samples. Our inequalities are tight under natural assumptions and can greatly simplify the analysis of common sequential decision problems. In particular, we apply them to sequential hypothesis testing, best arm identification, and sorting. The resulting algorithms rival or exceed the state of the art both theoretically and empirically.", "full_text": "Adaptive Concentration Inequalities\n\nfor Sequential Decision Problems\n\nShengjia Zhao\n\nTsinghua University\n\nzhaosj12@stanford.edu\n\nEnze Zhou\n\nTsinghua University\n\nzhouez_thu_12@126.com\n\nAshish Sabharwal\nAllen Institute for AI\n\nAshishS@allenai.org\n\nAbstract\n\nStefano Ermon\n\nStanford University\n\nermon@cs.stanford.edu\n\nA key challenge in sequential decision problems is to determine how many sam-\nples are needed for an agent to make reliable decisions with good probabilistic\nguarantees. We introduce Hoeffding-like concentration inequalities that hold for\na random, adaptively chosen number of samples. Our inequalities are tight under\nnatural assumptions and can greatly simplify the analysis of common sequential\ndecision problems. In particular, we apply them to sequential hypothesis testing,\nbest arm identi\ufb01cation, and sorting. The resulting algorithms rival or exceed the\nstate of the art both theoretically and empirically.\n\n1\n\nIntroduction\n\nMany problems in arti\ufb01cial intelligence (AI) and machine learning (ML) involve designing agents\nthat interact with stochastic environments. The environment is typically modeled with a collection\nof random variables. A common assumption is that the agent acquires information by observing\nsamples from these random variables. A key problem is to determine the number of samples that are\nrequired for the agent to make sound inferences and decisions based on the data it has collected.\nMany abstract problems \ufb01t into this general framework, including sequential hypothesis testing, e.g.,\ntesting for positiveness of the mean [18, 6], analysis of streaming data [19], best arm identi\ufb01cation\nfor multi-arm bandits (MAB) [1, 5, 13], etc. These problems involve the design of a sequential\nalgorithm that needs to decide, at each step, either to acquire a new sample, or to terminate and output\na conclusion, e.g., decide whether the mean of a random variable is positive or not. The challenge is\nthat obtaining too many samples will result in inef\ufb01cient algorithms, while taking too few might lead\nto the wrong decision.\nConcentration inequalities such as Hoeffding\u2019s inequality [11], Chernoff bound, and Azuma\u2019s inequal-\nity [7, 5] are among the main analytic tools. These inequalities are used to bound the probability of a\nlarge discrepancy between sample and population means, for a \ufb01xed number of samples n. An agent\ncan control its risk by making decisions based on conclusions that hold with high con\ufb01dence, due to\nthe unlikely occurrence of large deviations. However, these inequalities only hold for a \ufb01xed, constant\nnumber of samples that is decided a-priori. On the other hand, we often want to design agents that\nmake decisions adaptively based on the data they collect. That is, we would like the number of\nsamples itself to be a random variable. Traditional concentration inequalities, however, often do\nnot hold when the number of samples is stochastic. Existing analysis requires ad-hoc strategies to\nbypass this issue, such as union bounding the risk over time [18, 17, 13]. These approaches can lead\nto suboptimal algorithms.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWe introduce Hoeffding-like concentration inequalities that hold for a random, adaptively chosen\nnumber of samples. Interestingly, we can achieve our goal with a small double logarithmic overhead\nwith respect to the number of samples required for standard Hoeffding inequalities. We also show\nthat our bounds cannot be improved under some natural restrictions. Even though related inequalities\nhave been proposed before [15, 2, 3], we show that ours are signi\ufb01cantly tighter, and come with\na complete analysis of the fundamental limits involved. Our inequalities are directly applicable to\na number of sequential decision problems. In particular, we use them to design and analyze new\nalgorithms for sequential hypothesis testing, best arm identi\ufb01cation, and sorting. Our algorithms rival\nor outperform state-of-the-art techniques both theoretically and empirically.\n\n2 Adaptive Inequalities and Their Properties\n\nWe begin with some de\ufb01nitions and notation:\nDe\ufb01nition 1. [20] Let X be a zero mean random variable. For any d > 0, we say X is d-subgaussian\nif \u2200r \u2208 R,\n\nE[erX ] \u2264 ed2r2/2\n\nare i.i.d. random samples of X. Let Sn =(cid:80)n\n\nNote that a random variable can be subgaussian only if it has zero mean [20]. However, with some\nabuse of notation, we say that any random variable X is subgaussian if X \u2212 E[X] is subgaussian.\nMany important types of distributions are subgaussian. For example, by Hoeffding\u2019s Lemma [11],\na distribution bounded in an interval of width 2d is d-subgaussian and a Gaussian random variable\nN (0, \u03c32) is \u03c3-subgaussian. Henceforth, we shall assume that the distributions are 1/2-subgaussian.\nAny d-subgaussian random variable can be scaled by 1/(2d) to be 1/2-subgaussian\nDe\ufb01nition 2 (Problem setup). Let X be a zero mean 1/2-subgaussian random variable. {X1, X2, . . .}\ni=1 Xi be a random walk. J is a stopping time with\nrespect to {X1, X2, . . .}. We let J take a special value \u221e where Pr[J = \u221e] = 1\u2212 limn\u2192\u221e Pr[J \u2264\nn]. We also let f : N \u2192 R+ be a function that will serve as a boundary for the random walk.\nWe note that because it is possible for J to be in\ufb01nity, to simplify notation, what we really mean by\nPr[EJ ], where EJ is some event, is Pr[{J < \u221e} \u2229 EJ ]. We can often simplify notation and use\nPr[EJ ] without confusion.\n\n2.1 Standard vs. Adaptive Concentration Inequalities\n\nThere is a very large class of well known inequalities that bound the probability of large deviations by\ncon\ufb01dence that increases exponentially w.r.t. bound tightness. An example is the Hoeffding inequality\n[12] which states, using the de\ufb01nitions mentioned above,\n\n\u221a\n\nPr[Sn \u2265\n\nbn] \u2264 e\u22122b\n\n(1)\n\nOther examples include Azuma\u2019s inequality, Chernoff bound [7], and Bernstein inequalities [21].\nHowever, these inequalities apply if n is a constant chosen in advance, or independent of the\nunderlying process, but are generally untrue when n is a stopping time J that, being a random\nvariable, depends on the process. In fact we shall later show in Theorem 3 that we can construct a\nstopping time J such that\n\n\u221a\n\nPr[SJ \u2265\n\nbJ] = 1\n\n(2)\n\nfor any b > 0, even when we put strong restrictions on J.\nComparing Eqs. (1) and (2), one clearly sees how Chernoff and Hoeffding bounds are applicable only\nto algorithms whose decision to continue to sample or terminate is \ufb01xed a priori. This is a severe\nlimitation for stochastic algorithms that have uncertain stopping conditions that may depend on the\nunderlying process. We call a bound that holds for all possible stopping rules J an adaptive bound.\n\n2.2 Equivalence Principle\n\nWe start with the observation that \ufb01nding a probabilistic bound on the position of the random walk\nSJ that holds for any stopping time J is equivalent to \ufb01nding a deterministic boundary f (n) that the\nwalk is unlikely to ever cross. Formally,\n\n2\n\n\fProposition 1. For any \u03b4 > 0,\n\nPr[SJ \u2265 f (J)] \u2264 \u03b4\n\nfor any stopping time J if and only if\n\nPr[{\u2203n, Sn \u2265 f (n)}] \u2264 \u03b4\n\n(3)\n\n(4)\n\nIntuitively, for any f (n) we can choose an adversarial stopping rule that terminates the process as\nsoon as the random walk crosses the boundary f (n). We can therefore achieve (3) for all stopping\ntimes J only if we guarantee that the random walk is unlikely to ever cross f (n), as in Eq. (4).\n\n2.3 Related Inequalities\n\nThe problem of studying the supremum of a random walk has a long history. The seminal work of\nKolmogorov and Khinchin [4] characterized the limiting behavior of a zero mean random walk with\nunit variance:\n\nlim sup\nn\u2192\u221e\n\n\u221a\n\nSn\n\n2n log log n\n\n= 1 a.s.\n\nThis law is called the Law of Iterated Logarithms (LIL), and sheds light on the limiting behavior of a\nrandom walk. In our framework, this implies\n\n(cid:104)\u2203n > m : Sn \u2265(cid:112)2an log log n\n\n(cid:105)\n\n=\n\n(cid:26)1\n\n0\n\nlim\nm\u2192\u221e Pr\n\nif a < 1\nif a > 1\n\nThis theorem provides a very strong result on the asymptotic behavior of the walk. However, in most\nML and statistical applications, we are also interested in the \ufb01nite-time behavior, which we study.\nThe problem of analyzing the \ufb01nite-time properties of a random walk has been considered before\nin the ML literature. It is well known, and can be easily proven using Hoeffding\u2019s inequality union\nbounded over all possible times, that a trivial bound\n\n(5)\nholds in the sense of Pr [\u2203n, Sn \u2265 f (n)] \u2264 \u03b4. This is true because by union bound and Hoeffding\ninequality [12]\n\nf (n) =(cid:112)n log(2n2/\u03b4)/2\n\u221e(cid:88)\n\nP r[Sn \u2265 f (n)] \u2264\n\n\u221e(cid:88)\n\nn=1\n\ne\u2212 log(2n2/\u03b4) \u2264 \u03b4\n\nP r[\u2203n, Sn \u2265 f (n)] \u2264\n\n2n2 \u2264 \u03b4\n\u221a\nRecently, inspired by the Law of Iterated Logarithms, Jamieson et al. [15], Jamieson and Nowak\n[13] and Balsubramani [2] proposed a boundary f (n) that scales asymptotically as \u0398(\nn log log n)\nsuch that the \u201ccrossing event\u201d {\u2203n, Sn \u2265 f (n)} is guaranteed to occur with a low probability.\nThey refer to this as \ufb01nite time LIL inequality. These bounds, however, have signi\ufb01cant room for\nimprovement. Furthermore, [2] holds asymptotically, i.e., only w.r.t. the event {\u2203n > N, Sn \u2265 f (n)}\nfor a suf\ufb01ciently large (but \ufb01nite) N, rather than across all time steps. In the following sections, we\ndevelop general bounds that improve upon these methods.\n\nn=1\n\n1\n\n\u221e(cid:88)\n\nn=1\n\n3 New Adaptive Hoeffding-like Bounds\n\nables. {Sn =(cid:80)n\n\nOur \ufb01rst main result is an alternative to \ufb01nite time LIL that is both tighter and simpler:\nTheorem 1 (Adaptive Hoeffding Inequality). Let Xi be zero mean 1/2-subgaussian random vari-\n\ni=1 Xi, n \u2265 1} be a random walk. Let f : N \u2192 R+. Then,\n\n1. If limn\u2192\u221e\n\n\u221a\n\nf (n)\n\n(1/2)n log log n\n\n< 1, there exists a distribution for X such that\n\n2. If f (n) = (cid:112)an log(logc n + 1) + bn, c > 1, a > c/2, b > 0, and \u03b6 is the Riemann-\u03b6\n\nPr[{\u2203n, Sn \u2265 f (n)}] = 1\n\nfunction, then\n\nPr[{\u2203n, Sn \u2265 f (n)}] \u2264 \u03b6 (2a/c) e\u22122b/c\n\n(6)\n\n3\n\n\fWe also remark that in practice the values of a and c do not signi\ufb01cantly affect the quality of the\nbound. We recommend \ufb01xing a = 0.6 and c = 1.1 and will use this con\ufb01guration in all subsequent\nexperiments. The parameter b is the main factor controlling the con\ufb01dence we have on the bound (6),\ni.e., the risk. The value of b is chosen so that the bound holds with probability at least 1 \u2212 \u03b4, where \u03b4\nis a user speci\ufb01ed parameter.\nBased on Proposition 1, and \ufb01xing a and c as above, we get a readily applicable corollary:\nCorollary 1. Let J be any random variable taking value in N. If\n\nf (n) =(cid:112)0.6n log(log1.1 n + 1) + bn\n\nthen\n\nPr[SJ \u2265 f (J)] \u2264 12e\u22121.8b\n\n\u221a\n\nThe bound we achieve is very similar in form to Hoeffding inequality (1), with an extra O(log log n)\nslack to achieve robustness to stochastic, adaptively chosen stopping times. We shall refer to this\ninequality as the Adaptive Hoeffding (AH) inequality.\nInformally, part 1 of Theorem 1 implies that if we choose a boundary f (n) that is conver-\ngent w.r.t.\nn log log n and would like to bound the probability of the threshold-crossing event,\n\n(cid:112)(1/2)n log log n is the asymptotically smallest f (n) we can have; anything asymptotically smaller\n\nwill be crossed with probability 1. Furthermore, part 2 implies that as long as a > 1/2, we can\nchoose a suf\ufb01ciently large b so that threshold crossing has an arbitrarily small probability. Combined,\nwe thus have that for any \u03ba > 0, the minimum f (call it f\u2217) needed to ensure an arbitrarily small\nthreshold-crossing probability can be bounded asymptotically as follows:\n\n(cid:112)1/2(cid:112)n log log n \u2264 f\u2217(n) \u2264 ((cid:112)1/2 + \u03ba)(cid:112)n log log n\n\n(7)\n\n\u221a\n\nThis fact is illustrated in Figure 1, where we\nplot the bound f (n) from Corollary 1 with\n12e\u22121.8b = \u03b4 = 0.05 (AH, green). The corre-\nsponding Hoeffding bound (red) that would have\nheld (with the same con\ufb01dence, had n been a\nconstant) is plotted as well. We also show draws\nfrom an unbiased random walk (blue). Out of\nthe 1000 draws we sampled, approximately 25%\nof them cross the Hoeffding bound (red) before\ntime 105, while none of them cross the adaptive\nbound (green), demonstrating the necessity of\nthe extra\nWe also compare our bound with the trivial\nbound (5), LIL bound in Lemma 1 of [15] and\nTheorem 2 of [2]. The graph in Figure 2 shows\nthe relative performance of the three bounds\nacross different values of n and risk \u03b4. The LIL\nbound of [15] is plotted with parameter \u0001 = 0.01\nas recommended. We also experimented with\nother values of \u0001, obtaining qualitatively similar\nresults. It can be seen that our bound is signi\ufb01-\ncantly tighter (by roughly a factor of 1.5) across\nall values of n and \u03b4 that we evaluated.\n\nlog log n factor even in practice.\n\nFigure 1: Illustration of Theorem 1 part 2. Each\nblue line represents a sampled walk. Although the\nprobability of reaching higher than the Hoeffding\nbound (red) at a given time is small, the threshold\nis crossed almost surely. The new bound (green)\nremains unlikely to be crossed.\n\n3.1 More General, Non-Smooth Boundaries\n\nIf we relax the requirement that f (n) must be smooth, or, formally, remove the condition that\n\nmust exist or go to \u221e, then we might be able to obtain tighter bounds.\n\n\u221a\n\nlim\nn\u2192\u221e\n\nf (n)\n\nn log log n\n\n4\n\n\fFigure 2: Comparison of Adaptive Hoeffding (AH) and LIL [15], LIL2 [2] and Trivial bound. A\nthreshold function f (n) is computed and plotted according to the four bounds, so that crossing occurs\nwith bounded probability \u03b4 (risk). The two plots correspond to different risk levels (0.01 and 0.1).\n\nFor example many algorithms such as median elimination [9] or the exponential gap algorithm [17, 6]\nmake (sampling) decisions \u201cin batch\u201d, and therefore can only stop at certain pre-de\ufb01ned times. The\nintuition is that if more samples are collected between decisions, the failure probability can be easier\nto control. This is equivalent to restricting the stopping time J to take values in a set N \u2282 N.\nEquivalently we can also think of using a boundary function f (n) de\ufb01ned as follows:\n\n(cid:26)f (n) n \u2208 N\n\n(8)\n\nfN(n) =\n\n+\u221e otherwise\n\nVery often the set N is taken to be the following set:\nDe\ufb01nition 3 (Exponentially Sparse Stopping Time). We denote by Nc, c > 1, the set Nc = {(cid:100)cn(cid:101) :\nn \u2208 N}.\nMethods based on exponentially sparse stopping times often achieve asymptotically optimal per-\nformance on a range of sequential decision making problems [9, 18, 17]. Here we construct an\nalternative to Theorem 1 based on exponentially sparse stopping times. We obtain a bound that is\nasymptotically equivalent, but has better constants and is often more effective in practice.\nTheorem 2 (Exponentially Sparse Adaptive Hoeffding Inequality). Let {Sn, n \u2265 1} be a random\nwalk with 1/2-subgaussian increments. If\n\nf (n) =(cid:112)an log(logc n + 1) + bn\n\nand c > 1, a > 1/2, b > 0, we have\n\nPr[{\u2203n \u2208 Nc, Sn \u2265 f (n)}] \u2264 \u03b6(2a) e\u22122b\n\nWe call this inequality the exponentially sparse adaptive Hoeffding (ESAH) inequality. Compared to\n(6), the main improvement is the lack of the constant c in the RHS. In all subsequent experiments we\n\ufb01x a = 0.55 and c = 1.05.\nFinally, we provide limits for any boundary, including those obtained by a batch-sampling strategy.\nTheorem 3. Let {Sn, n \u2265 1} be a zero mean random walk with 1/2-subgaussian increments. Let\nf : N \u2192 R+. Then\n\n1. If there exists a constant C \u2265 0 such that lim inf n\u2192\u221e f (n)\u221a\nPr[{\u2203n, Sn \u2265 f (n)}] = 1\n\nn < C, then\n\n2. If limn\u2192\u221e f (n)\u221a\n\nn = +\u221e, then for any \u03b4 > 0 there exists an in\ufb01nite set N \u2282 N such that\n\nPr[{\u2203n \u2208 N, Sn \u2265 f (n)}] < \u03b4\n\n5\n\n\f\u221a\nInformally, part 1 states that if a threshold f (n) drops an in\ufb01nite number of times below an asymptotic\nn), then the threshold will be crossed with probability 1. This rules out Hoeffding-like\nbound of \u0398(\nbounds. If f (n) grows asymptotically faster than\nn, then one can \u201csparsify\u201d f (n) so that it will be\ncrossed with an arbitrarily small probability. In particular, a boundary with the form in Equation (8)\ncan be constructed to bound the threshold-crossing probability below any \u03b4 (part 2 of the Theorem).\n\n\u221a\n\n4 Applications to ML and Statistics\n\nWe now apply our adaptive bound results to design new algorithms for various classic problems in ML\nand statistics. Our bounds can be used to analyze algorithms for many natural sequential problems,\nleading to a uni\ufb01ed framework for such analysis. The resulting algorithms are asymptotically optimal\nor near optimal, and outperform competing algorithms in practice. We provide two applications in\nthe following subsections and leave another to the appendix.\n\n4.1 Sequential Testing for Positiveness of Mean\n\nOur \ufb01rst example is sequential testing for the positiveness of the mean of a bounded random variable.\nIn this problem, there is a 1/2-subgaussian random variable X with (unknown) mean \u00b5 (cid:54)= 0. At each\nstep, an agent can either request a sample from X, or terminate and declare whether or not E[X] > 0.\nThe goal is to bound the agent\u2019s error probability by some user speci\ufb01ed value \u03b4.\nThis problem is well studied [10, 18, 6]. In particular Karp and Kleinberg [18] show in Lemma 3.2\n\n(\u201csecond simulation lemma\u201d) that this problem can be solved with an O(cid:0)log(1/\u03b4) log log(1/\u00b5)/\u00b52(cid:1)\nalgorithm with con\ufb01dence 1 \u2212 \u03b4. They also prove a lower bound of \u2126(cid:0)log log(1/\u00b5)/\u00b52(cid:1). Recently,\n\nChen and Li [6] referred to this problem as the SIGN-\u03be problem and provided similar results.\nWe propose an algorithm that achieves the optimal asymptotic complexity and performs very well\nin practice, outperforming competing algorithms by a wide margin (because of better asymptotic\nconstants). The algorithm is captured by the following de\ufb01nition.\nDe\ufb01nition 4 (Boundary Sequential Test). Let f : N \u2192 R+ be a function. We draw i.i.d. samples\n\nX1, X2, . . . from the target distribution X. Let Sn =(cid:80)n\n\ni=1 Xi be the corresponding partial sum.\n\n1. If Sn \u2265 f (n), terminate and declare E[X] > 0;\n2. if Sn \u2264 \u2212f (n), terminate and declare E[X] < 0;\n3. otherwise increment n and obtain a new sample.\n\nWe call such a test a symmetric boundary test. In the following theorem we analyze its performance.\nTheorem 4. Let \u03b4 > 0 and X be any 1/2-subgaussian distribution with non-zero mean. Let\n\nf (n) =(cid:112)an log(logc n + 1) + bn\n\nFigure 3: Empirical Performance of Boundary Tests. The plot on the left is the algorithm in\nDe\ufb01nition 4 and Theorem 4 with \u03b4 = 0.05, the plot on the right uses half the correct threshold.\nDespite of a speed up of 4 times, the empirical accuracy drops below the requirement\n\n6\n\n\fwhere c > 1, a > c/2, and b = c/2 log \u03b6 (2a/c) + c/2 log 1/\u03b4. Then, with probability at least 1 \u2212 \u03b4,\na symmetric boundary test terminates with the correct sign for E[X], and with probability 1 \u2212 \u03b4, for\nany \u0001 > 0 it terminates in at most\n\n(cid:18) log(1/\u03b4) log log(1/\u00b5)\n\n(cid:19)\n\n(2c + \u0001)\n\n\u00b52\n\nsamples asymptotically w.r.t. 1/\u00b5 and 1/\u03b4.\n\n4.1.1 Experiments\n\nTo evaluate the empirical performance of our algorithm (AH-RW), we run an experiment where\nX is a Bernoulli distribution over {\u22121/2, 1/2}, for various values of the mean parameter \u00b5. The\ncon\ufb01dence level \u03b4 is set to 0.05, and the results are averaged across 100 independent runs. For this\nexperiment and other experiments in this section, we set the parameters a = 0.6 and c = 1.1. We\nplot in Figure 3 the empirical accuracy, average number of samples used (runtime), and the number\nof samples after which 90% of the runs terminate.\nThe empirical accuracy of AH-RW is very high,\nas predicted by Theorem 4. Our bound is em-\npirically very tight. If we decrease the bound by\na factor of 2, that is we use f (n)/2 instead of\nf (n), we get the curve in the right hand side plot\nof Figure 3. Despite a speed up of approximately\n4 times, the empirical accuracy gets below the\n0.95 requirement, especially when \u00b5 is small.\nWe also compare our method, AH-RW, to the\nExponential Gap algorithm from [6] and the al-\ngorithm from the \u201csecond simulation lemma\u201d\nof [18]. Both of these algorithms rely on a\nbatch sampling idea and have very similar per-\nformance. The results show that our algorithm\nis at least an order of magnitude faster (note\nthe log-scale). We also evaluate a variant of\nour algorithm (ESAH-RW) where the boundary\nfunction f (n) is taken to be fNc as in Theorem 2\nand Equation (8). This algorithm achieves very\nsimilar performance as Theorem 4, justifying\nthe practical applicability of batch sampling.\n\nFigure 4: Comparison of various algorithms for de-\nciding the positiveness of the mean of a Bernoulli\nrandom variable. AH-RW and ESAH-RW use or-\nders of magnitude fewer samples than alternatives.\n\n4.2 Best Arm Identi\ufb01cation\n\nThe MAB (Multi-Arm Bandit) problem [1, 5] studies the optimal behavior of an agent when faced\nwith a set of choices with unknown rewards. There are several \ufb02avors of the problem. In this paper,\nwe focus on the \ufb01xed con\ufb01dence best arm identi\ufb01cation problem [13]. In this setting, the agent\nis presented with a set of arms A, where the arms are indistinguishable except for their expected\nreward. The agent is to make sequential decisions at each time step to either pull an arm \u03b1 \u2208 A, or to\nterminate and declare one arm to have the largest expected reward. The goal is to identify the best\narm with a probability of error smaller than some pre-speci\ufb01ed \u03b4 > 0.\nTo facilitate the discussion, we \ufb01rst de\ufb01ne the notation we will use. We denote by K = |A| as the\ntotal number of arms. We denote by \u00b5\u03b1 the true mean of an arm, \u03b1\u2217 = arg max \u00b5\u03b1, We also de\ufb01ne\n\u02c6\u00b5\u03b1(n\u03b1) as the empirical mean after n\u03b1 pulls of an arm.\nThis problem has been extensively studied, including recently [8, 14, 17, 15, 6]. A survey is presented\nby Jamieson and Nowak [13], who classify existing algorithms into three classes: action elimination\nbased [8, 14, 17, 6], which achieve good asymptotics but often perform unsatisfactorily in practice;\nUCB based, such as lil\u2019UCB by [15]; and LUCB based approaches, such as [16, 13], which achieve\nsub-optimal asymptotics of O(K log K) but perform very well in practice. We provide a new\nalgorithm that out-performs all previous algorithm, including LUCB, in Algorithm 1.\nTheorem 5. For any \u03b4 > 0, with probability 1 \u2212 \u03b4, Algorithm 1 outputs the optimal arm.\n\n7\n\n\fAlgorithm 1 Adaptive Hoeffding Race (set of arms A, K = |A|, parameter \u03b4)\n\n\ufb01x parameters a = 0.6, c = 1.1, b = c/2 (log \u03b6 (2a/c) + log(2/\u03b4))\ninitialize for all arms \u03b1 \u2208 A, n\u03b1 = 0, initialize \u02c6A = A be the set of remaining arms\nwhile \u02c6A has more than one arm do\n\nLet \u02c6\u03b1\u2217 be the arm with highest empirical mean, and compute for all \u03b1 \u2208 \u02c6A\n\n(cid:17)\n\n/n\u03b1\n\nif \u03b1 = \u02c6\u03b1\u2217\notherwise\n\n\uf8f1\uf8f2\uf8f3\n\n(cid:114)(cid:16)\n(cid:112)(a log(logc n\u03b1 + 1) + b) /n\u03b1\n\na log(logc n\u03b1 + 1) + b + c log | \u02c6A|/2\n\nf\u03b1(n\u03b1) =\n\ndraw a sample from the arm with largest value of f\u03b1(n\u03b1) from \u02c6A, n\u03b1 = n\u03b1 + 1\nremove from \u02c6A arm \u03b1 if \u02c6\u00b5a + f\u03b1(n\u03b1) < \u02c6\u00b5 \u02c6\u03b1\u2217 \u2212 f \u02c6\u03b1\u2217 (n \u02c6\u03b1\u2217 )\n\nend while\nreturn the only element in \u02c6A\n\n4.2.1 Experiments\n\nWe implemented Algorithm 1 and a variant\nwhere the boundary f is set to fNc as in Theo-\nrem 2. We call this alternative version ES-AHR,\nstanding for exponentially sparse adaptive Ho-\neffding race. For comparison we implemented\nthe lil\u2019UCB and lil\u2019UCB+LS described in [14],\nand lil\u2019LUCB described in [13]. Based on the\nresults of [13], these algorithms are the fastest\nknown to date.\nWe also implemented the DISTRIBUTION-\nBASED-ELIMINATION from [6], which theo-\nretically is the state-of-the-art in terms of asymp-\ntotic complexity. Despite this fact, the empirical\nperformance is orders of magnitude worse com-\npared to other algorithms for the instance sizes\nwe experimented with.\nWe experimented with most of the distribution\nfamilies considered in [13] and found qualita-\ntively similar results. We only report results us-\ning the most challenging distribution we found\nthat was presented in that survey, where \u00b5i = 1 \u2212 (i/K)0.6. The distributions are Gaussian with 1/4\n\u03b1 hardness [13].\n\nvariance, and \u03b4 = 0.05. The sample count is measured in units of H1 =(cid:80)\n\nFigure 5: Comparison of various methods for best\narm identi\ufb01cation. Our methods AHR and ES-\nAHR are signi\ufb01cantly faster than state-of-the-art.\nBatch sampling ES-AHR is the most effective one.\n\n\u03b1(cid:54)=\u03b1\u2217 \u2206\u22122\n\n5 Conclusions\n\nWe studied the threshold crossing behavior of random walks, and provided new concentration\ninequalities that, unlike classic Hoeffding-style bounds, hold for any stopping rule. We showed that\nthese inequalities can be applied to various problems, such as testing for positiveness of mean, best\narm identi\ufb01cation, obtaining algorithms that perform well both in theory and in practice.\n\nAcknowledgments\n\nThis research was supported by NSF (#1649208) and Future of Life Institute (#2016-158687).\n\nReferences\n[1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\n2002.\n\n8\n\n\f[2] A. Balsubramani. Sharp Finite-Time Iterated-Logarithm Martingale Concentration. ArXiv e-prints, May\n\n2014. URL https://arxiv.org/abs/1405.2639.\n\n[3] A. Balsubramani and A. Ramdas. Sequential Nonparametric Testing with the Law of the Iterated Logarithm.\n\nArXiv e-prints, June 2015. URL https://arxiv.org/abs/1506.03486.\n\n[4] Leo Breiman. Probability. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992.\n\nISBN 0-89871-296-3.\n\n[5] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge university press,\n\n2006.\n\n[6] Lijie Chen and Jian Li. On the optimal sample complexity for best arm identi\ufb01cation. CoRR,\n\nabs/1511.03774, 2015. URL http://arxiv.org/abs/1511.03774.\n\n[7] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet\n\nMath., 3(1):79\u2013127, 2006. URL http://projecteuclid.org/euclid.im/1175266369.\n\n[8] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and Markov\ndecision processes. In Jyrki Kivinen and Robert H. Sloan, editors, Computational Learning Theory, volume\n2375 of Lecture Notes in Computer Science, pages 255\u2013270. Springer Berlin Heidelberg, 2002.\n\n[9] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the\nmulti-armed bandit and reinforcement learning problem. Journal of Machine Learning Research (JMLR),\n2006.\n\n[10] R. H. Farrell. Asymptotic behavior of expected sample size in certain one sided tests. Ann. Math. Statist.,\n\n35(1):36\u201372, 03 1964.\n\n[11] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American\n\nStatistical Association, 1963.\n\n[12] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American\n\nstatistical association, 58(301):13\u201330, 1963.\n\n[13] Kevin Jamieson and Robert Nowak. Best-arm identi\ufb01cation algorithms for multi-armed bandits in the\n\n\ufb01xed con\ufb01dence setting, 2014.\n\n[14] Kevin Jamieson, Matthew Malloy, R. Nowak, and S. Bubeck. On \ufb01nding the largest mean among many.\n\nArXiv e-prints, June 2013.\n\n[15] Kevin Jamieson, Matthew Malloy, Robert Nowak, and S\u00e9bastien Bubeck. lil\u2019UCB : An optimal exploration\n\nalgorithm for multi-armed bandits. Journal of Machine Learning Research (JMLR), 2014.\n\n[16] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC subset selection in stochastic\n\nmulti-armed bandits. In ICML-2012, pages 655\u2013662, New York, NY, USA, June-July 2012.\n\n[17] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In\n\nICML-2013, volume 28, pages 1238\u20131246. JMLR Workshop and Conference Proceedings, May 2013.\n\n[18] Richard M. Karp and Robert Kleinberg. Noisy binary search and its applications.\n\nIn Proceedings\nof the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA \u201907, pages 881\u2013890,\nPhiladelphia, PA, USA, 2007.\n\n[19] Volodymyr Mnih, Csaba Szepesv\u00e1ri, and Jean-Yves Audibert. Empirical bernstein stopping. In ICML-2008,\n\npages 672\u2013679, New York, NY, USA, 2008.\n\n[20] Omar Rivasplata. Subgaussian random variables: An expository note, 2012.\n\n[21] Pranab K. Sen and Julio M. Singer. Large Sample Methods in Statistics: An Introduction with Applications.\n\nChapman and Hall, 1993.\n\n9\n\n\f", "award": [], "sourceid": 741, "authors": [{"given_name": "Shengjia", "family_name": "Zhao", "institution": "Tsinghua University"}, {"given_name": "Enze", "family_name": "Zhou", "institution": "Tsinghua University"}, {"given_name": "Ashish", "family_name": "Sabharwal", "institution": "Allen Institute for AI"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}