{"title": "Cost Effective Active Search", "book": "Advances in Neural Information Processing Systems", "page_first": 4880, "page_last": 4889, "abstract": "We study a special paradigm of active learning, called cost effective active search, where the goal is to find a given number of positive points from a large unlabeled pool with minimum labeling cost. Most existing methods solve this problem heuristically, and few theoretical results have been established. We adopt a principled Bayesian approach for the first time. We first derive the Bayesian optimal policy and establish a strong hardness result: the optimal policy is hard to approximate, with the best-possible approximation ratio lower bounded by $\\Omega(n^{0.16})$. We then propose an efficient and nonmyopic policy using the negative Poisson binomial distribution. We propose simple and fast approximations for computing its expectation, which serves as an essential role in our proposed policy. We conduct comprehensive experiments on various domains such as drug and materials discovery, and demonstrate that our proposed search procedure is superior to the widely used greedy baseline.", "full_text": "Cost effective active search\n\nShali Jiang\nCSE, WUSTL\n\nSt. Louis, MO 63130\njiang.s@wustl.edu\n\nBenjamin Moseley\n\nTepper School of Business, CMU and\n\nRelational AI\n\nPittsburgh, PA 15213\n\nmoseleyb@andrew.cmu.edu\n\nRoman Garnett\nCSE, WUSTL\n\nSt. Louis, MO 63130\ngarnett@wustl.edu\n\nAbstract\n\nWe study a paradigm of active learning we call cost effective active search, where\nthe goal is to \ufb01nd a given number of positive points from a large unlabeled pool with\nminimum labeling cost. Most existing methods solve this problem heuristically, and\nfew theoretical results have been established. Here we adopt a principled Bayesian\napproach for the \ufb01rst time. We \ufb01rst derive the Bayesian optimal policy and establish\na strong hardness result: the optimal policy is hard to approximate, with the best-\npossible approximation ratio bounded below by \u2126(n0.16). We then propose an\nef\ufb01cient and nonmyopic policy, simulating future search progress using the negative\nPoisson binomial distribution. We propose simple and fast approximations for\nits expectation, which serves as an essential role in our proposed policy. We\nconduct comprehensive experiments on drug and materials discovery datasets and\ndemonstrate that our proposed method is superior to a popular (greedy) baseline.\n\n1\n\nIntroduction\n\nActive search is a particular realization of active learning where the goal is to identify positive points\nfrom a large pool of unlabeled candidates. To make this problem copelling, we typically assume\nthat the positive points are extremely rare (e.g., <1%) and that labeling process costly, making\nthe search challenging. A prototypical application is drug discovery, where we seek to identify\ncompounds exhibiting binding activity with a certain biological target from among millions of\ncandidates [Warmuth et al., 2001, Garnett et al., 2015]. Other applications include materials discovery\n[Jiang et al., 2018b], product recommendation [Vanchinathan et al., 2015], etc.\nThere have been major theoretical and algorithmic advances on active search in recent years [Garnett\net al., 2012, 2015, Jiang et al., 2017, 2018a]. This work has primarily focussed on the budgeted\nsetting, where the goal is to \ufb01nd as many positive points as possible in a given budget B, the total\nnumber of points one can query, known a priori. In this paper, we consider the \u201cdual\u201d problem: how\nto \ufb01nd a given number of positives with minimum cost. Formally, given a large pool of n points\nX = {xi}n\ni=1 with corresponding unknown labels yi \u2208 {0, 1} indicating whether xi is positive or\nnot, we want to sequentially choose a set D \u2282 X of points to evaluate to identify a given target\nnumber T of positives in as few iterations as possible. We call this problem cost effective active\nsearch (CEAS). To contrast with the budgeted setting, we present both formulations:\nBudgeted: arg max\n\n|D|, s.t.(cid:80)\n\nyi, s.t. |D| = B;\n\nCEAS: arg min\n\nD\u2282X ,(cid:80)\n\nxi\u2208D\n\nD\u2286X\n\nyi \u2265 T.\n\nxi\u2208D\n\nOne might wonder if we could reduce CEAS to the budgeted case, for which many effective policies\nare readily available [Jiang et al., 2017]. For example, given a CEAS problem with target T , we could\ninstead solve a budgeted problem with an estimated budget B. However, this \u201cdual\u201d transformation\nis not one-to-one: given a budget B, the expected utility is no longer T . In fact, both Pr(utility | B)\nand Pr(cost | T ) are highly complicated distributions and estimating their expectations is extremely\nintractable, as we will show later. Therefore it is necessary to develop speci\ufb01c policies for CEAS.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fCEAS is a direct model of practical cases where we must achieve a certain target with minimum cost;\nfor example, during initial screening for drug discovery, we may seek a certain number of active\ncompounds to serve as lead compounds to be re\ufb01ned later in the discovery process. There have been\nseveral previous investigations into the the CEAS setting (e.g., Warmuth et al. [2001, 2003]), but the\nproposed policies are often based on heuristics, and to the best of our knowledge, few theoretical\nresults have been established regarding the hardness of this problem.\nIn this paper, we study the CEAS problem under the Bayesian decision-theoretic framework. We \ufb01rst\nderive the Bayesian optimal policy and establish a strong hardness result showing that the optimal\npolicy is extremely inapproximable. In particular, we show any ef\ufb01cient algorithm is at least \u2126(n0.16)\ntimes worse than the optimal policy in terms of the expected cost needed.\nWe then propose nonmyopic approximations to the Bayesian optimal policy and develop ef\ufb01cient\nimplementation and pruning techniques that make nonmyopic search in large spaces possible. To that\nend, we discuss an understudied distribution called the \u201cnegative Poisson binomial\u201d and propose a\nsimple and fast approximation to its expectation, an essential component of our ef\ufb01cient nonmyopic\npolicy. We conduct comprehensive experiments on benchmark datasets in various domains such as\ndrug and materials discovery and demonstrate that our policy is superior to widely used baselines.\n\n2 Related work\n\nActive search in the budgeted setting has been well studied under the Bayesian decision-theoretic\nframework in recent years. Garnett et al. [2012] derived the Bayesian optimal policy and proved that\nan (cid:96)-step lookahead policy can be arbitrarily better than an m-step one for any (cid:96) > m and proposed\nthe use of myopic approximations in practice, e.g., one- or two-step lookahead, for computational\ntractability. Garnett et al. [2015] comprehensively compared one- and two-step lookahead policies on\n360 drug discovery datasets and showed that the two-step policy performs signi\ufb01cantly better than\ngreedy search most of the time.\nJiang et al. [2017] proposed an ef\ufb01cient nonmyopic approximation to the Bayesian optimal policy for\nthe budgeted setting by assuming conditional independence and generalized this approach to batch\n\u221a\npolicies [Jiang et al., 2018a]. They also proved a hardness result showing that the approximation ratio\nfor the budgeted case is bounded above by O(1/\nlog n) (note it is a maximization setting). Here\nwe apply a similar proof technique to the CEAS setting but prove a much stronger bound of order\n\u2126(n0.16).\nActive search is a form of active learning with a special utility (or cost) function. Chen and Krause\n[2013] studied minimum-cost active learning under the version space reduction utility and proved\nthat a greedy algorithm is near-optimal due to adaptive submodularity. Warmuth et al. [2001]\nstudied \u201cactive learning in the drug discovery process.\u201d They used perceptron and SVM as predictive\nmodels and tried several variants of uncertainty sampling (sampling points closest to the separating\nhyperplane) and greedy sampling (choosing points furthest from the hyperplane). The authors found\nthat greedy sampling performed better, but offered no theoretical justi\ufb01cation, in contrast to our\ndecision theoretic approach.\nCEAS is a special case of the adaptive stochastic minimum cost cover problem [Golovin and Krause,\n2011], for which an inapproximability bound was proved using a spiritually similar instance construc-\ntion to our own, including the idea of \u201ctreasure hunting\u201d and using XOR for encoding. However, their\nworst case construction is much more complicated and does not directly correspond to active search.\nFurthermore, their bound only holds conditionally (PH (cid:54)= \u03a3P\n2 ). Our proof is much simpler and does\nnot rely on any complexity theoretic hypothesis.\nActive search is also closely related to the total recall problem in information retrieval [Grossman\net al., 2016, Yu and Menzies, 2018, Renders, 2018] or the so-called technology-assisted review\nprogress for evidence discovery in legal settings [Grossman and Cormack, 2010], where the goal\nis to retrieve (nearly) all relevant documents at a reasonable labeling cost. Despite the similarity\nto our setting, retrieving all positives could be drastically different (and arguably much harder).\nOne immediate difference is that we often do not know the total number of positives in a dataset a\npriori, thus a considerable amount of research is devoted to deciding when to stop in the total recall\nprocedure [Grossman et al., 2016]. Our de\ufb01nition of CEAS avoids this complication.\n\n2\n\n\f3 Bayesian optimal policy\n\nWe motivate our approach using Bayesian decision theory. We assume there is a model that provides\nthe posterior probability of a point x being positive, conditioned on previously observed data D, i.e.,\nPr(y = 1 | x,D). The Bayesian optimal policy then chooses the point that minimizes the expected\nnumber of further iterations required. Formally, in iteration (i + 1), where we have observed Di, then\n(1)\nwhere cr denotes the further cost incurred (i.e., the size of the chosen set) when r more positives are\n\nidenti\ufb01ed after Di, and r = T \u2212(cid:80)\n\n(x,y)\u2208Di y is the remaining target yet to be achieved.\n\nE[cr | x,Di],\n\nx\u2217 = arg minx\n\nThe expected cost can be written as a Bellman equation:\n\nE[cr | x,Di] = 1 + Pr(y = 1 | x,Di) \u00b7 minx(cid:48) E[cr\u22121 | x(cid:48), x, y = 1,Di] +\n\nPr(y = 0 | x,Di) \u00b7 minx(cid:48) E[cr | x(cid:48), x, y = 0,Di].\n\n(2)\n\nHowever, this recursion is not mathematically well de\ufb01ned since there is no exit. To see this, consider\nthe base case where r = 1, i.e., only one more positive left to \ufb01nd. With probability Pr(y = 1 | x,Di),\nx is positive, and we \ufb01nish with cost 1; with probability 1 \u2212 Pr(y = 1 | x,Di), x is negative, and\nthen we need to update the probability conditioned on y = 0 and repeat this process. That is,\n\nE[c1 | x,Di] = Pr(y = 1 | x,Di) \u00b7 1+(1 \u2212 Pr(y = 1 | x,Di)) \u00b7 min E[c1 | x,Di].\n\n(3)\n\nSo even for r = 1, the recursion never exits. Note this is very different from the budgeted setting\n[Garnett et al., 2012], where the base case with budget 1 is well de\ufb01ned and trivial to compute.\nIn practice, the recursion must stop after exhausting the whole pool. We assume the search stops after\nat most t steps; we stress that this is only for derivation purposes, and should not be confused with the\nbudgeted setting with budget t. If we set t = |X|, then that means we keep searching until the target\nis achieved or there are no more points left to evaluate. Let cr\nt be the cost incurred after r positives\nare found or t evaluations are completed. Now we can derive the base case of the recursion. To\nsimplify notations, we will omit the conditioning on Di in the following and assume all probabilities\nare implicitly conditioned on current observations. Let p(x) = Pr(y = 1 | x,Di). We start with the\nsimplest case r = 1:\nWhen t = 1, the expected cost would be 1 no matter what.\nWhen t = 2, E[c1\ning the point with highest probability is optimal.\nWhen t = 3,\n\n2 | x] = p(x) \u00b7 1 + (1 \u2212 p(x)) \u00b7 2 = 2 \u2212 p(x). In this case, the greedy policy choos-\n\nE[c1\n\n3 | x] = p(x) \u00b7 1 + (1 \u2212 p(x)) \u00b7 minx(cid:48) E[c1\n= 2 \u2212 p(x) \u2212 (1 \u2212 p(x))maxx(cid:48) p(cid:48)(x(cid:48)).\n\n2 | x(cid:48), x, y = 0]\n\n(4)\n\nwhere p(cid:48)(x(cid:48)) = Pr(y(cid:48) = 1 | x(cid:48), x, y = 0). We can see\n\nE[c1\n\narg minx\n\n3 | x] \u21d4 arg maxx p(x) + (1 \u2212 p(x)) maxx(cid:48) p(cid:48)(x(cid:48)).\n\n(5)\nNote the form on the right hand side of (5) has a clear exploration vs. exploitation explanation. It\nbalances between choosing a point with high probability as governed by p(x) (exploit) and a point\nthat has low probability but would lead to a high probability point if it turns out to be negative as\ngoverned by (1 \u2212 p(x)) maxx(cid:48) p(cid:48)(x(cid:48)) (explore). This is counter-intuitive since one would imagine\nwith only one positive left to \ufb01nd, the best thing to do should be greedy, but we have shown if we are\ngranted three or more iterations, greedy might not be optimal in terms of minimizing expected cost.\nTherefore, we argue that nonmyopic planing is crucial for CEAS.\nWe draw a connection between (5) and the two-step score [Garnett et al., 2012] in budgeted setting:\narg maxx p(x) + p(x) maxx(cid:48) p(y(cid:48) = 1 | x(cid:48), y = 1)+(1 \u2212 p(x)) maxx(cid:48) p(y(cid:48) = 1 | x(cid:48), y = 0);\n(6)\nwe can see in (6), the future utility is averaged over the positive and negative case, whereas in (5)\nonly the negative case is considered.\nThe recursion is now well-de\ufb01ned for r = 1:\n\nE[c1\n\nt | x] = p(x) \u00b7 1+(1 \u2212 p(x)) \u00b7 minx(cid:48) E[c1\n\nt\u22121 | x(cid:48), x, y = 0].\n\n(7)\n\n3\n\n\fNote it takes O(nt) time to compute even for the simplest case r = 1 . In general (r \u2265 1), we have\n\nE[cr\n\nt | x] = 1 + p(x) \u00b7 minx(cid:48) E[cr\u22121\n\nt\u22121 | x(cid:48), x, y = 1] +\n\n(1 \u2212 p(x)) \u00b7 minx(cid:48) E[cr\n\nt\u22121 | x(cid:48), x, y = 0].\n\n(8)\n\n3.1 Hardness of cost effective active search\n\nThe above derivation indicates that we can not ef\ufb01ciently compute the expected cost exactly. In fact,\nthe optimal policy is not only hard to compute, it is even hard to approximate. As we show in the\nfollowing theorem, any ef\ufb01cient algorithm is at least \u2126(n0.16) times worse than the optimal policy in\nterms of average cost:\n\nTheorem 1. Any algorithm A with computational complexity o(cid:0)nn\u0001(cid:1) has an approximation ratio\n\n\u2126(n\u0001), for \u0001 = 0.16; that is,\n\nE[costA]\n\nOPT\n\n= \u2126 (n\u0001) ,\n\n(9)\n\nwhere E[costA] is the expected cost of A, and OPT is that of the optimal policy.\nNote that this lower bound is very tight since O(n) is a trivial upper bound. We prove the theorem by\nconstructing a class of active search instances similar to that of the hardness proof for the budgeted\nsetting [Jiang et al., 2017]. Speci\ufb01cally, there is a small secret set of points, the labels of which\nencode the location of a clump of positive points. The optimal policy (with unlimited computational\npower) could easily identify this secret set by enumerating all possible subsets of the same size\nand feeding them into the inference model, thereby revealing the positive clump and completing\nthe search quickly. However, for an algorithm with limited computational power, the probability of\nrevealing this secret set is extremely low, and hence it can not do any better than randomly guessing,\nwhich results in a much higher expected cost.\nOur proof has two key differences compared to previous work. First, it results in an exponentially\nstronger bound (n0.16 vs\nlog n). Second, the improved bound requires a different methodology on\nhow to hide the set of pro\ufb01table points from the algorithm. In particular, a new counting technique is\nused for bounding the probability of identifying the \u201csecret set\u201d. In the budgeted setting, one key\nargument considered \u201cIf an ef\ufb01cient policy selects a subset of B points, how likely is it to identify the\nsecret set?\u201d, which was straightforward to compute since the chosen budget B de\ufb01ned the cardinality\nof the selected subset. Such reasoning does not apply to CEAS setting. In fact, we leverage the\nunderlying algorithmic challenge of optimizing the newly considered objective where the algorithm\nhas to continue searching until it reaches the target. A formal proof is given in the appendix.\n\n\u221a\n\n4 Approximation of the Bayesian optimal policy\n\nThe main cause of the high complexity for computing the expected cost is that we need to recursively\nupdate the probabilities of the remaining points conditioned on possible outcomes of the previous\npoints. So a natural way to relieve the burden is to assume conditional independence (CI) after several\nsteps; e.g., after observing the point we are considering. This idea has been adopted to approximate\nthe expected utility in the budgeted setting, and was demonstrated to perform very well [Jiang et al.,\n2017, 2018a] in practice. We propose to adapt this idea to CEAS. While in the budgeted setting, the\nexpected future utility under the CI assumption is simply the sum of the top probabilities up to the\nremaining budget, it is much less straightforward to compute the expected cost in CEAS setting.\nWe model the remaining cost as a negative Poisson binomial (NPB) distribution, i.e., the number\nof coins (with nonuniform biases) that need to be \ufb02ipped (independently) to get a given number of\nHEADs. This is in contrast to the utility being modeled as a Poisson binomial (PB) distribution in\nthe budgeted setting. The NPB distribution is a natural generalization of the negative binomial (NB)\ndistribution where all coins have the same biases. While both NB and PB distributions are well studied,\nfew references can be found for NPB. Some informal discussions about NPB can be found in the\nPhysics Forum1. [Charalambous, 2014] de\ufb01nes a distribution they also call NPB, but it is actually a\nsum of geometric variables. [Liebscher and Kirschstein, 2017] de\ufb01nes NPB as the number of \u201cfailures\ngiven successes\u201d for predicting the outcome of darts tournaments. In the following, we formally\n\n1https://www.physicsforums.com/threads/negative-poisson-binomial-distribution.759630/\n\n4\n\n\fde\ufb01ne the NPB distribution and propose novel and fast approximations to its expectation, based on\nwhich we derive our ef\ufb01cient nonmyopic policy.\n\n4.1 Negative Poisson binomial distribution\n\nWe de\ufb01ne the NPB distribution in an intuitive manner as follows:\nDe\ufb01nition 1 (Negative Poisson binomial distribution). Let there be an in\ufb01nite number of ordered\ncoins with HEADs probabilities p1, p2, p3,\u00b7\u00b7\u00b7 ; given the number r of HEADs required, we toss the\ncoins one by one in this order until r HEADs occur. We say the number of coins tossed m follows a\nnegative Poisson binomial distribution, or m (cid:118) NPB(r, [p1, p2, p3,\u00b7\u00b7\u00b7 ]).\nNote that this distribution is supported for any integer m \u2265 r. We adopt a truncated version where we\nhave a \ufb01nite number of coins, n, and we assume n is large enough so that Pr(m \u2265 n) is negligible,\nwhich is typically the case in active search since r (cid:28) n, where n the size of unlabeled pool and r is\nthe target to achieve.\nIts PMF can be derived using the PMF of a Poisson binomial (PB) distribution r (cid:118) PB([p1, p2,\u00b7\u00b7\u00b7 , pn]),\nwhich is the number of HEADs if we independently toss n coins with HEADs probabilities\np1, p2, . . . , pn. Denote PrPB(i, j) as the probability of j HEADs when there are i coins with probabil-\nities p1, p2, . . . , pi, then PrPB(n, r) can be computed via dynamic programming (DP):\n\n(cid:26)(cid:81)n\ni=1(1 \u2212 pi),\npnPrPB(n \u2212 1, r \u2212 1) + (1 \u2212 pn)PrPB(n \u2212 1, r),\n\nif r = 0;\nif 0 < r <= n;\n\nPrPB(n, r) =\n\n(10)\nThe DP table of PrPB(n, r) is of size O(nr). Chen and Liu [1997] also derived other formulas for\ncomputing the PMF of the PB distribution.\nGiven the PMF of the PB distribution, the PMF of m (cid:118) NPB(r, [p1,\u00b7\u00b7\u00b7 , pn]) is: \u2200 m \u2265 r\n\nand the expectation is E[m] =(cid:80)n\n\nPrNPB(m, r) = pmPrPB(m \u2212 1, r \u2212 1),\n\n(11)\ni=r PrNPB(i, r) \u00b7 i. Note in computing PrNPB(n, r), all PrNPB(m, r)\n\nfor m < n are also computed. So the complexity for computing the expectation is also O(nr).\n\n4.2 Approximation of the NPB expectation\nThe complexity O(nr) for computing the expectation is prohibitively high. To reduce its complexity,\n(cid:80) \u00afm\nwe observe in practice PrNPB(m, r) will be very close to zero for m (cid:29) r. So we can stop at \u00afm when\ni=r PrNPB(i, r) \u2265 1 \u2212 \u0001 for, e.g., \u0001 = 10\u22126. Hence, an almost exact solution can be computed in\nO( \u00afmr). We call this \u0001-DP.\nThe complexity O( \u00afmr) for computing the expectation in (4.1) may still be high since we might need\nto compute this expectation for every candidate point in each iteration. We propose another cheap\nbut accurate approximate method with only O(E[m]) complexity. The idea is simple: coin toss i\ncontributes pi HEADs in expectation, so we accumulate this until r HEADs occur. That is,\n\nE[m] \u2248 arg mink\n\n(12)\n(cid:80)k\nWe call this approximation ACCU (short for \u201caccumulate\u201d). Note that ACCU always returns an integer,\nwhile the true expectation might not be integral. To \ufb01x this, we subtract a correction term. Let\ni=1 pi \u2265 r. We check what portion of p \u02c6m was needed for the sum to be exactly r,\n\u02c6m = arg mink\nand remove the extra portion. That is,\n\n(cid:80)k\ni=1 pi \u2265 r.\n\n(cid:16)(cid:80) \u02c6m\n\n(cid:17) \u2212 r\n\ni=1 pi\np \u02c6m\n\nE[m] \u2248 \u02c6m \u2212\n\n.\n\n(13)\n\nWe call this approximation ACCU\u2019. In the special case of the negative binomial distribution (i.e.\np = p1 = p2 = \u00b7\u00b7\u00b7 = pn), ACCU\u2019 recovers the true expectation r/p.\nOne might be tempted to approximate the expectation by a natural generalization of the expectation\nof a negative binomial distribution r/p = 1/p + 1/p + \u00b7 + 1/p; that is, E[m] \u2248 1/p1 + 1/p2 + \u00b7 + 1/pr.\nWe call this RECIP. Note that this approximation is also exact when p1 = p2 = \u00b7\u00b7\u00b7 = pn = p. We\nwill see that this approximation can be very poor.\n\n5\n\n\fTable 1: Time cost and quality of various approximations of the expectation of NPB distribution.\n\nEXACT\n\n-\n\n97.4841\n\n\u0001-DP\n0.0004\n0.1851\n\nACCU\u2019\n0.0438\n0.0029\n\nACCU\n0.5723\n0.0029\n\nRECIP\n7.8276\n0.0002\n\nRMSE\ntime(s)\n\nWe run simulations to demonstrate how close these approximations are. We parameterize the NPB\ndistribution using the posterior marginal probabilities [p1, . . . , pn] computed in a typical active search\niteration using a k-nn model (see Figure 2 in the appendix). We plot the approximation errors against\nthe EXACT expectation computed via (4.1) for r = 1, . . . , 500, shown in Figure 1a. We can see\n\u0001-DP has basically zero error everywhere. ACCU\u2019 also has zero error almost everywhere except two\nlocations. ACCU constantly overestimates the exact value by a small fraction, due to being an integer,\nwhereas RECIP considerably underestimates the true value, especially when r is large. The root mean\nsquare error (RMSE) and total time cost for computing the 500 expectations are shown in Table 1.\nAfter a closer look at the locations where the errors are higher (e.g. around r = 220, for which\nE[m] \u2248 300), we \ufb01nd that such locations exactly correspond to r values such that the probability\naround E[m] drops abruptly (refer to Figure 2 in the appendix). This makes perfect sense since ACCU\u2019\nand RECIP are not aware of such changes after E[m]. RECIP suffers from this problem more severely\nsince it only looks at r probability values but ACCU\u2019 looks at E[m] values. Overall, we can see ACCU\u2019\nserves as an appropriate time-quality tradeoff, and we will use this approximation for our policy. It is\nan interesting question whether we can derive error bounds for ACCU\u2019.\n\n4.3 Ef\ufb01cient nonmyopic approximations\n\nWe can approximate the expected cost (8) as follows:\n\nE[cr | x,Di] \u2248 1 + Pr(y = 1 | x,Di)E[m+]+Pr(y = 0 | x,Di)E[m\u2212] \u2261 f (x),\n\n(14)\nwhere E[m+] \u2248 E[cr\u22121 | x, y = 1,Di], E[m\u2212] \u2248 E[cr | x, y = 0,Di], and Pr(cr\u22121 | x, y = 1,Di)\nand Pr(cr | x, y = 1,Di) are assumed to be NPB distributions de\ufb01ned by posterior marginal\nprobabilities (in descending order) after the respective conditioning . We use ACCU\u2019 to compute the\napproximate expectations. In every iteration, we choose x minimizing f (x). We will call this policy\nef\ufb01cient nonmyopic cost effective search, or ENCES.\nWe further propose to adapt our policy by treating the remaining target r as a tuning parameter. In\nparticular, we argue that it is better to set this parameter smaller than the actual remaining target.\nThe rationale is three-fold: (1) the approximation based on the NPB distribution typically over\nestimates the actual expected cost, since we are pretending the probabilities will not change after the\ncurrent iteration, whereas actually the top probabilities de\ufb01ning the NPB distribution will increase\nconsiderably as we discover more positives. So the actual expected cost should be much smaller. (2)\neven if the CI assumption is correct, planning too far ahead might hurt if the model is not accurate\nenough; after all, \u201call models are wrong.\u201d (3) setting r smaller makes the bounds on the expectation\nmuch tighter, hence the pruning is much more effective (details on pruning included in the appendix).\nWe consider two schemes: setting r to a constant or proportional to the remaining target. For example,\nENCES-10 means we always set r = 10 if the remaining target is greater than 10, otherwise we use\nthe actual remaining target; and ENCES-0.2 means we set r to be 20% of the actual remaining target.\n\n5 Experiments\n\nWe conduct experiments to compare our proposed policy against three baselines2:\nGREEDY:\nthis is the most widely used policy, which always chooses a point with the highest\nprobability. It is an equivalent of choosing the point furthest from the hyperplane in the case of an\nSVM model [Warmuth et al., 2001]. It has been shown that this baseline is hard to beat in the total\nrecall setting [Grossman et al., 2016].\n\n2Matlab implementations of our method and the baselines are available here: https://github.com/\n\nshalijiang/efficient_nonmyopic_active_search.git\n\n6\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Approximation errors of various approximation methods for computing the expectation\nof NPB distribution. y-axis is approximate E[m] minus exact E[m]. The error for RECIP drops below\n-10 after about r = 250, hence not shown to zoom in the interesting part of the \ufb01gure. (b) Average\ncost versus the number of positives found, averaged over 9 drug discovery datasets.\n\nTable 2: Results on materials discovery data. Averaged over 30 experiments.\n\nGREEDY\nTWO-STEP\nENS-70\nENS-0.7\nENCES-50\nENCES-0.5\n\n50\n84.5\n86.0\n81.5\n80.2\n85.2\n87.1\n\n100\n175.0\n179.1\n165.1\n167.8\n156.4\n164.3\n\n200\n347.7\n349.0\n337.7\n328.4\n330.8\n336.9\n\n300\n522.5\n533.2\n524.6\n509.7\n518.4\n513.7\n\n400\n721.8\n735.0\n720.0\n708.6\n701.4\n683.9\n\n500\n924.1\n938.1\n910.0\n887.4\n885.7\n891.9\n\n1000\n2025.9\n1973.1\n1790.6\n1798.5\n1745.4\n1774.1\n\n1500\n2981.7\n3019.4\n2757.0\n2773.6\n2753.2\n2724.0\n\naverage\n972.9\n976.6\n910.8\n906.8\n897.1\n897.0\n\nTWO-STEP: the budgeted two step lookahead policy as de\ufb01ned in (6). Note we use (6) instead of (5)\nsince (5) looks ahead by considering the maximum probability a point would lead to if it is negative,\nso it does not make much sense to use this policy with k-nn, which only models positive correlation.\nENS: the ef\ufb01cient nonmyopic search policy recently developed by Jiang et al. [2017] for the budgeted\nsetting, informally de\ufb01ned as follows: arg maxx p(x) + Ey[sum of top b prob. | x, y], where b is the\nremaining budget after choosing x. This policy was shown to perform remarkably well in various\ndomains due to its adaptation to the remaining budget. However, to apply it in the CEAS setting, we\nhave to make a simple modi\ufb01cation since these is no concept of budget. We also experiment with two\nschemes of setting b: constant or proportional to the remaining target. We test ENS-10,30,50,70 and\nENS-0.1, 0.3, 0.5, 0.7, and compare our method against the best one. Such b\u2019s underestimate the true\nbudget, but the same rationale given in Sec. 4.3 applies.\nWe run our policies with r = 10, 20, 30, 40, 50 and 0.1, 0.2, 0.3, 0.4, 0.5, and also report the results\nof the best one (on average) in each scheme. Full results are in the appendix. We report results on\ndrug and materials discovery data [Jiang et al., 2017]. In all our experiments, we start the search with\na single randomly selected positive point, and repeat the experiments 30 times.\nMaterials discovery. We apply active search to discover novel alloys forming bulk metallic glasses\n(BMGs). The database [Jiang et al., 2017] is composed of 118 678 known alloys with 4 746 (about 4%)\nhaving the desired property. We test T = 50, 100, 200, 300, 400, 500, 1000, 1500. Table 2 shows the\naverage cost. Following Jiang et al. [2018a], we highlight the entries with lowest cost in boldface, and\nthose not signi\ufb01cantly worse than the best in blue italic, under one sided paired t-tests with \u03b1 = 0.05.\nWe summarize the results as follows: (1) all of the nonmyopic policies outperform GREEDY, and\nTWO-STEP performs on par with GREEDY. (2) ENCES variants are mostly the best or not signi\ufb01cantly\nworse than the best. (3) ENS is a strong baseline, being the best or not signi\ufb01cantly worse than ENCES\n\n7\n\n0200400\u221220rapproximationerror\u0001-DPACCUACCU\u2019RECIP0501001502000200400600numberofpositivesfoundaveragecostGREEDYTWO-STEPENSENCES\fTable 3: Averaged results on nine drug discovery datasets, 30 experiments for each.\n\nGREEDY\nTWO-STEP\nENS-30\nENS-0.7\nENCES-20\nENCES-0.2\n\n50\n215.7\n71.7\n58.8\n59.1\n56.3\n72.9\n\n100\n414.4\n156.0\n134.9\n132.8\n112.7\n116.0\n\n150\n503.2\n243.2\n208.3\n212.0\n184.5\n194.8\n\n200\n587.4\n322.4\n283.3\n284.2\n255.1\n298.9\n\naverage\n430.2\n198.4\n171.3\n172.0\n152.2\n170.7\n\nfor several cases. (4) ENCES with r being half of the remaining target performs the best on average,\nbut r = 50 is not signi\ufb01cantly worse.\nDrug discovery. Now we consider the main application of active search: drug discovery. We\nsimulate virtual drug screening to \ufb01nd chemical compounds exhibiting binding activities with a\ncertain biological target. We use the \ufb01rst nine drug discovery datasets as described in Sec. 5.1 of\nJiang et al. [2018a]. Each dataset corresponds to a different biological target. The number of positives\nin the nine datasets are 553, 378, 506, 1023, 218, 916, 1024, 431, 255, with a shared pool of 100 000\nnegatives. We set T = 50, 100, 150, 200. The average costs are shown in Table 3. Each entry in this\ntable is averaged over the nine datasets and 30 experiments each, so in total 270 experiments for each\npolicy and target T .\nWe see a consistent winner: ENCES-20. On average it outperforms all baselines by a large margin.\nIn particular, it improves over GREEDY by a 56-73% reduction in average cost. ENCES-0.2 also\nperforms very well, not signi\ufb01cantly worse than ENCES-20 when T = 50. Though we only presented\nresults of our method with the best parameters, we point out that it always outperforms GREEDY by\na large margin for all other parameters. Full results are in the appendix. Also note that the tested\nENS variants, which solve the CEAS problem by reducing it to the budgeted setting, are much better\nthan the myopic methods, but still signi\ufb01cantly worse than our proposed method, which suggests it is\nbene\ufb01cial to design speci\ufb01c policies for cost effective setting.\nTo see how these policies behave during the course of search, we plot in Figure 1b the average cost for\neach policy along the way until T = 200 positives are found. The individual curves for each of the\nnine datasets are shown in the appendix. We only show ENCES-20 and ENS-30 in this plot; the curves\nof ENCES-0.2 and ENS-0.7 are very similar. We see the average cost of GREEDY exhibits a \u201cpiecewise\nlinear\u201d shape. This is likely due to its greedy behavior [Jiang et al., 2017]: it keeps exploiting the high\nprobability points around a discovered positive neighborhood until they are exhausted, then it has to\nspend a long time to \ufb01nd another neighborhood in a blind-minded way since it did not learn much\nabout the space, resulting in the discontinuity of the cost curve after a certain number of positives are\ncollected. In contrast, TWO-STEP and ENS have much smoother behavior with minimal discontinuity,\nand ENCES has almost perfectly linear cost w.r.t. T , with the slope only slightly greater than one.\nThis is a very desirable property for cost effective active search.\n\n6 Conclusion\n\nIn this paper, we introduced and studied cost effective active search under the Bayesian decision-\ntheoretic framework. This is the \ufb01rst principled study of the problem in the literature. We derived\nthe Bayesian optimal policy and proved a novel hardness result: the optimal policy is extremely\ninapproximable, with approximation ratio bounded below by \u2126(n0.16). We then proposed an ef\ufb01cient\nstrategy to approximate the optimal policy using the negative Poisson binomial distribution and\nproposed ef\ufb01cient approximations for its expectation. We demonstrated the superior performance of\nour proposed policy to several baselines, including the widely used greedy policy and the state-of-\nthe-art nonmyopic policy adapted from budgeted active search. The performance on drug discovery\ndatasets was especially encouraging, with a 56\u201373% cost reduction on average compared to greedy\nsampling.\nRegarding the tuning parameter in our policy, one rule-of-thumb is to set it to be relatively small (e.g.,\n\u2264 50 or \u2264 50% of the remaining target). How to adapt it in a more principled way is an interesting\n\n8\n\n\ffuture direction. Another future direction is to extend the proposed method to batch setting, where\nmultiple points are evaluated simultaneously.\n\nAcknowledgments\n\nWe would like to thank Mark Bober for providing support regarding computational services. SJ and\nRG were supported by the National Science Foundation (NSF) under award numbers IIA\u20131355406,\nIIS\u20131845434, and OAC\u20131940224. BM was supported by a Google Research Award and by the NSF\nunder awards CCF\u20131830711, CCF\u20131824303, and CCF\u20131733873.\n\nReferences\nC. Charalambous. On the evolution of particle fragmentation with applications to planetary surfaces,\n\n2014. Chapter 5.\n\nS. X. Chen and J. S. Liu. Statistical applications of the Poisson-binomial and conditional Bernoulli\n\ndistributions. Statistica Sinica, pages 875\u2013892, 1997.\n\nY. Chen and A. Krause. Near-optimal Batch Mode Active Learning and Adaptive Submodular\nOptimization. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International\nConference on Machine Learning (ICML 2013), volume 28 of Proceedings of Machine Learning\nResearch, pages 160\u2013168, 2013.\n\nR. Garnett, Y. Krishnamurthy, X. Xiong, J. G. Schneider, and R. P. Mann. Bayesian Optimal Active\nSearch and Surveying. In Proceedings of the 29th International Conference on Machine Learning\n(ICML 2012), 2012.\n\nR. Garnett, T. G\u00e4rtner, M. Vogt, and J. Bajorath. Introducing the \u2018active search\u2019 method for iterative\n\nvirtual screening. Journal of Computer-Aided Molecular Design, 29(4):305\u2013314, 2015.\n\nD. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and\n\nstochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42:427\u2013486, 2011.\n\nM. R. Grossman and G. V. Cormack. Technology-assisted review in e-discovery can be more effective\n\nand more ef\ufb01cient than exhaustive manual review. Rich. JL & Tech., 17:1, 2010.\n\nM. R. Grossman, G. V. Cormack, and A. Roegiest. TREC 2016 Total Recall Track Overview. TREC,\n\n2016.\n\nS. Jiang, G. Malkomes, G. Converse, A. Shofner, B. Moseley, and R. Garnett. Ef\ufb01cient Nonmyopic\nIn D. Precup and Y. W. Teh, editors, Proceedings of the 34th International\nActive Search.\nConference on Machine Learning (ICML 2017), volume 70 of Proceedings of Machine Learning\nResearch, pages 1714\u20131723, 2017.\n\nS. Jiang, G. Malkomes, M. Abbott, B. Moseley, and R. Garnett. Ef\ufb01cient nonmyopic batch active\n\nsearch. In Advances in Neural Information Processing Systems, pages 1107\u20131117, 2018a.\n\nS. Jiang, G. Malkomes, B. Moseley, and R. Garnett. Ef\ufb01cient nonmyopic active search with\napplications in drug and materials discovery. In Machine Learning for Molecules and Materials\nWorkshop at Neural Information Processing Systems, 2018b.\n\nS. Liebscher and T. Kirschstein. Predicting the outcome of professional darts tournaments. Interna-\n\ntional Journal of Performance Analysis in Sport, 17(5):666\u2013683, 2017.\n\nJ.-M. Renders. Active search for high recall: A non-stationary extension of thompson sampling. In\nG. Pasi, B. Piwowarski, L. Azzopardi, and A. Hanbury, editors, Advances in Information Retrieval,\npages 722\u2013728, 2018.\n\nH. P. Vanchinathan, A. Marfurt, C.-A. Robelin, D. Kossmann, and A. Krause. Discovering Valuable\nItems from Massive Data. In Proceedings of the 21st ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining (KDD 2015), pages 1195\u20131204, 2015.\n\n9\n\n\fM. K. Warmuth, G. R\u00e4tsch, M. Mathieson, J. Liao, and C. Lemmen. Active Learning in the Drug\nDiscovery Process. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural\nInformation Processing Systems 14, pages 1449\u20131456, 2001.\n\nM. K. Warmuth, J. Liao, G. R\u00e4tsch, M. Mathieson, S. Putta, and C. Lemmen. Active Learning with\nSupport Vector Machines in the Drug Discovery Process. Journal of Chemical Information and\nComputer Sciences, 43(2):667\u2013673, 2003.\n\nZ. Yu and T. Menzies. Total recall, language processing, and software engineering. In Proceedings of\nthe 4th ACM SIGSOFT International Workshop on NLP for Software Engineering, pages 10\u201313,\n2018.\n\n10\n\n\f", "award": [], "sourceid": 2701, "authors": [{"given_name": "Shali", "family_name": "Jiang", "institution": "Washington University in St. Louis"}, {"given_name": "Roman", "family_name": "Garnett", "institution": "Washington University in St. Louis"}, {"given_name": "Benjamin", "family_name": "Moseley", "institution": "Carnegie Mellon University"}]}