{"title": "Bayesian Batch Active Learning as Sparse Subset Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 6359, "page_last": 6370, "abstract": "Leveraging the wealth of unlabeled data produced in recent years provides great potential for improving supervised models. When the cost of acquiring labels is high, probabilistic active learning methods can be used to greedily select the most informative data points to be labeled. However, for many large-scale problems standard greedy procedures become computationally infeasible and suffer from negligible model change. In this paper, we introduce a novel Bayesian batch active learning approach that mitigates these issues. Our approach is motivated by approximating the complete data posterior of the model parameters. While naive batch construction methods result in correlated queries, our algorithm produces diverse batches that enable efficient active learning at scale. We derive interpretable closed-form solutions akin to existing active learning procedures for linear models, and generalize to arbitrary models using random projections. We demonstrate the benefits of our approach on several large-scale regression and classification tasks.", "full_text": "Bayesian Batch Active Learning as\n\nSparse Subset Approximation\n\nRobert Pinsler\n\nDepartment of Engineering\nUniversity of Cambridge\n\nrp586@cam.ac.uk\n\nJonathan Gordon\n\nDepartment of Engineering\nUniversity of Cambridge\n\njg801@cam.ac.uk\n\nEric Nalisnick\n\nDepartment of Engineering\nUniversity of Cambridge\n\netn22@cam.ac.uk\n\nJos\u00e9 Miguel Hern\u00e1ndez-Lobato\n\nDepartment of Engineering\nUniversity of Cambridge\n\njmh233@cam.ac.uk\n\nAbstract\n\nLeveraging the wealth of unlabeled data produced in recent years provides great\npotential for improving supervised models. When the cost of acquiring labels is\nhigh, probabilistic active learning methods can be used to greedily select the most\ninformative data points to be labeled. However, for many large-scale problems\nstandard greedy procedures become computationally infeasible and suffer from\nnegligible model change. In this paper, we introduce a novel Bayesian batch\nactive learning approach that mitigates these issues. Our approach is motivated by\napproximating the complete data posterior of the model parameters. While naive\nbatch construction methods result in correlated queries, our algorithm produces\ndiverse batches that enable ef\ufb01cient active learning at scale. We derive interpretable\nclosed-form solutions akin to existing active learning procedures for linear models,\nand generalize to arbitrary models using random projections. We demonstrate the\nbene\ufb01ts of our approach on several large-scale regression and classi\ufb01cation tasks.\n\n1\n\nIntroduction\n\nMuch of machine learning\u2019s success stems from leveraging the wealth of data produced in recent\nyears. However, in many cases expert knowledge is needed to provide labels, and access to these\nexperts is limited by time and cost constraints. For example, cameras could easily provide images\nof the many \ufb01sh that inhabit a coral reef, but an ichthyologist would be needed to properly label\neach \ufb01sh with the relevant biological information. In such settings, active learning (AL) [1] enables\ndata-ef\ufb01cient model training by intelligently selecting points for which labels should be requested.\nTaking a Bayesian perspective, a natural approach to AL is to choose the set of points that maximally\nreduces the uncertainty in the posterior over model parameters [2]. Unfortunately, solving this combi-\nnatorial optimization problem is NP-hard. Most AL methods iteratively solve a greedy approximation,\ne.g. using maximum entropy [3] or maximum information gain [2, 4]. These approaches alternate\nbetween querying a single data point and updating the model, until the query budget is exhausted.\nHowever, as we discuss below, sequential greedy methods have severe limitations in modern machine\nlearning applications, where datasets are massive and models often have millions of parameters.\nA possible remedy is to select an entire batch of points at every AL iteration. Batch AL approaches\ndramatically reduce the computational burden caused by repeated model updates, while resulting in\nmuch more signi\ufb01cant learning updates. It is also more practical in applications where the cost of\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) MAXENT\n\n(b) BALD\n\n(c) Ours\n\nFigure 1: Batch construction of different AL methods on cifar10, shown as a t-SNE projection [12].\nGiven 5000 labeled points (colored by class), a batch of 200 points (black crosses) is queried.\n\nacquiring labels is high but can be parallelized. Examples include crowd-sourcing a complex labeling\ntask, leveraging parallel simulations on a compute cluster, or performing experiments that require\nresources with time-limited availability (e.g. a wet-lab in natural sciences). Unfortunately, naively\nconstructing a batch using traditional acquisition functions still leads to highly correlated queries [5],\ni.e. a large part of the budget is spent on repeatedly choosing nearby points. Despite recent interest in\nbatch methods [5\u20138], there currently exists no principled, scalable Bayesian batch AL algorithm.\nIn this paper, we propose a novel Bayesian batch AL approach that mitigates these issues. The key\nidea is to re-cast batch construction as optimizing a sparse subset approximation to the log posterior\ninduced by the full dataset. This formulation of AL is inspired by recent work on Bayesian coresets\n[9, 10]. We leverage these similarities and use the Frank-Wolfe algorithm [11] to enable ef\ufb01cient\nBayesian AL at scale. We derive interpretable closed-form solutions for linear and probit regression\nmodels, revealing close connections to existing AL methods in these cases. By using random\nprojections, we further generalize our algorithm to work with any model with a tractable likelihood.\nWe demonstrate the bene\ufb01ts of our approach on several large-scale regression and classi\ufb01cation tasks.\n\n2 Background\nWe consider discriminative models p(y|x, \u03b8) parameterized by \u03b8 \u2208 \u0398, mapping from inputs x \u2208 X\nto a distribution over outputs y \u2208 Y. Given a labeled dataset D0 = {xn, yn}N\nn=1, the learning task\nconsists of performing inference over the parameters \u03b8 to obtain the posterior distribution p(\u03b8|D0).\nIn the AL setting [1], the learner is allowed to choose the data points from which it learns. In addition\nto the initial dataset D0, we assume access to (i) an unlabeled pool set Xp = {xm}M\nm=1, and (ii) an\noracle labeling mechanism which can provide labels Yp = {ym}M\nm=1 for the corresponding inputs.\nProbabilistic AL approaches choose points by considering the posterior distribution of the model\nparameters. Without any budget constraints, we could query the oracle M times, yielding the\ncomplete data posterior through Bayes\u2019 rule,\n\np(\u03b8|D0) p(Yp|Xp, \u03b8)\n\np(\u03b8|D0 \u222a (Xp,Yp)) =\n\n(1)\nwhere here p(\u03b8|D0) plays the role of the prior. While the complete data posterior is optimal from a\nBayesian perspective, in practice we can only select a subset, or batch, of points D(cid:48) = (X (cid:48),Y(cid:48)) \u2286 Dp\ndue to budget constraints. From an information-theoretic perspective [2], we want to query points\nX (cid:48) \u2286 Xp that are maximally informative, i.e. minimize the expected posterior entropy,\n\np(Yp|Xp,D0)\n\n,\n\nX \u2217 = arg min\n\nX (cid:48)\u2286Xp, |X (cid:48)|\u2264b\n\nEY(cid:48)\u223cp(Y(cid:48)|X (cid:48),D0) [H [\u03b8|D0 \u222a (X (cid:48),Y(cid:48))]] ,\n\n(2)\n\nwhere b is a query budget. Solving Eq. (2) directly is intractable, as it requires considering all possible\nsubsets of the pool set. As such, most AL strategies follow a myopic approach that iteratively chooses\na single point until the budget is exhausted. Simple heuristics, e.g. maximizing the predictive entropy\n(MAXENT), are often employed [13, 5]. Houlsby et al. [4] propose BALD, a greedy approximation\nto Eq. (2) which seeks the point x that maximizes the decrease in expected entropy:\nH [\u03b8|D0] \u2212 Ey\u223cp(y|x,D0) [H [\u03b8|x, y,D0]] .\n\n(3)\n\nx\u2217 = arg min\nx\u2208Xp\n\nWhile sequential greedy strategies can be near-optimal in certain cases [14, 15], they become severely\nlimited for large-scale settings. In particular, it is computationally infeasible to re-train the model\n\n2\n\n\fafter every acquired data point, e.g. re-training a ResNet [16] thousands of times is clearly impractical.\nEven if such an approach were feasible, the addition of a single point to the training set is likely\nto have a negligible effect on the parameter posterior distribution [5]. Since the model changes\nonly marginally after each update, subsequent queries thus result in acquiring similar points in data\nspace. As a consequence, there has been renewed interest in \ufb01nding tractable batch AL formulations.\nPerhaps the simplest approach is to naively select the b highest-scoring points according to a standard\nacquisition function. However, such naive batch construction methods still result in highly correlated\nqueries [5]. This issue is highlighted in Fig. 1, where both MAXENT (Fig. 1a) and BALD (Fig. 1b)\nexpend a large part of the budget on repeatedly choosing nearby points.\n\n3 Bayesian batch active learning as sparse subset approximation\n\nWe propose a novel probabilistic batch AL algorithm that mitigates the issues mentioned above. Our\nmethod generates batches that cover the entire data manifold (Fig. 1c), and, as we will show later, are\nhighly effective for performing posterior inference over the model parameters. Note that while our\napproach alternates between acquiring data points and updating the model for several iterations in\npractice, we restrict the derivations hereafter to a single iteration for simplicity.\nThe key idea behind our batch AL approach is to choose a batch D(cid:48), such that the updated log\nposterior log p(\u03b8|D0 \u222a D(cid:48)) best approximates the complete data log posterior log p(\u03b8|D0 \u222a Dp). In\nAL, we do not have access to the labels before querying the pool set. We therefore take expectation\n\nw.r.t. the current predictive posterior distribution p(Yp|Xp,D0) =(cid:82) p(Yp|Xp, \u03b8) p(\u03b8|D0)d\u03b8. The\n\nexpected complete data log posterior is thus\n\nE\nYp\n\n[log p(\u03b8|D0 \u222a (Xp,Yp))] = E\nYp\n\n[log p(\u03b8|D0) + log p(Yp|Xp, \u03b8) \u2212 log p(Yp|Xp,D0)]\n\n= log p(\u03b8|D0) + E\nYp\n\n[log p(Yp|Xp, \u03b8)] + H[Yp|Xp,D0]\n\n= log p(\u03b8|D0) +\n\n[log p(ym|xm, \u03b8)] + H [ym|xm,D0]\n\n(cid:32)\n\nM(cid:88)\n\nm=1\n\nE\nym\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nLm(\u03b8)\n\n(cid:33)\n\n,\n\n(4)\n\n(cid:125)\n\nwhere the \ufb01rst equality uses Bayes\u2019 rule (cf. Eq. (1)), and the third equality assumes conditional\nindependence of the outputs given the inputs. This assumption holds for the type of factorized\npredictive posteriors we consider, e.g. as induced by Gaussian or Multinomial likelihood models.\n\nBatch construction as sparse approximation Taking inspiration from Bayesian coresets [9, 10],\nwe re-cast Bayesian batch construction as a sparse approximation to the expected complete data log\nposterior. Since the \ufb01rst term in Eq. (4) only depends on D0, it suf\ufb01ces to choose the batch that\nm Lm(\u03b8). Similar to Campbell and Broderick [10], we view Lm : \u0398 (cid:55)\u2192 R and\nm Lm as vectors in function space. Letting w \u2208 {0, 1}M be a weight vector indicating which\nm wmLm (with slight abuse of notation),\n\nbest approximates(cid:80)\nL =(cid:80)\npoints to include in the AL batch, and denoting L(w) =(cid:80)\n\nwe convert the problem of constructing a batch to a sparse subset approximation problem, i.e.\n\nw\u2217 = minimize\n\nw\n\n(cid:107)L \u2212 L(w)(cid:107)2\n\nsubject to wm \u2208 {0, 1} \u2200m,\n\n1m \u2264 b.\n\n(5)\n\n(cid:88)\n\nm\n\nIntuitively, Eq. (5) captures the key objective of our framework: a \u201cgood\" approximation to L implies\nthat the resulting posterior will be close to the (expected) posterior had we observed the complete pool\nset. Since solving Eq. (5) is generally intractable, in what follows we propose a generic algorithm to\nef\ufb01ciently \ufb01nd an approximate solution.\n\nInner products and Hilbert spaces We propose to construct our batches by solving Eq. (5) in a\nHilbert space induced by an inner product (cid:104)Ln,Lm(cid:105) between function vectors, with associated norm\n(cid:107) \u00b7 (cid:107). Below, we discuss the choice of speci\ufb01c inner products. Importantly, this choice introduces a\nnotion of directionality into the optimization procedure, enabling our approach to adaptively construct\nquery batches while implicitly accounting for similarity between selected points.\n\n3\n\n\f(cid:28)\n\n(cid:29)\n\nN(cid:88)\n\nm=1\n\nand replace the cardinality constraint with a polytope constraint. Let \u03c3m = (cid:107)Lm(cid:107), \u03c3 =(cid:80)\n\nFrank-Wolfe optimization To approximately solve the optimization problem in Eq. (5) we follow\nthe work of Campbell and Broderick [10], i.e. we relax the binary weight constraint to be non-negative\nm \u03c3m, and\nK \u2208 RM\u00d7M be a kernel matrix with Kmn = (cid:104)Lm,Ln(cid:105). The relaxed optimization problem is\n\nminimize\n\nw\n\n(1 \u2212 w)T K (1 \u2212 w)\n\nsubject to wm \u2265 0 \u2200m,\n\nwm\u03c3m = \u03c3,\n\n(6)\n\n(cid:88)\n\nm\n\nwhere we used (cid:107)L \u2212 L(w)(cid:107)2 = (1 \u2212 w)T K (1 \u2212 w). The polytope has vertices {\u03c3/\u03c3m 1m}M\nm=1\nand contains the point w = [1, 1, . . . , 1]T . Eq. (6) can be solved ef\ufb01ciently using the Frank-Wolfe\nalgorithm [11], yielding the optimal weights w\u2217 after b iterations. The complete AL procedure,\nActive Bayesian CoreSets with Frank-Wolfe optimization (ACS-FW), is outlined in Appendix A (see\nAlgorithm A.1). The key computation in Algorithm A.1 (Line 6) is\n\nL \u2212 L(w),\n\nLn\n\n1\n\u03c3n\n\n=\n\n1\n\u03c3n\n\n(1 \u2212 wm)(cid:104)Lm,Ln(cid:105) ,\n\n(7)\n\nwhich only depends on the inner products (cid:104)Lm,Ln(cid:105) and norms \u03c3n = (cid:107)Ln(cid:107). At each iteration, the\nalgorithm greedily selects the vector Lf most aligned with the residual error L \u2212 L(w). The weights\nw are then updated according to a line search along the f th vertex of the polytope (recall that the\noptimum of a convex objective over a polytope\u2014as in Eq. (6)\u2014is attained at the vertices), which by\nconstruction is the f th-coordinate unit vector. This corresponds to adding at most one data point to\nthe batch in every iteration. Since the algorithm allows to re-select indices from previous iterations,\nthe resulting weight vector has \u2264 b non-zero entries. Empirically, we \ufb01nd that this property leads to\nsmaller batches as more data points are acquired.\nSince it is non-trivial to leverage the continuous weights returned by the Frank-Wolfe algorithm in a\nprincipled way, the \ufb01nal step of our algorithm is to project the weights back to the feasible space,\ni.e. set \u02dcw\u2217\nm > 0, and 0 otherwise. While this projection step increases the approximation\nerror, we show in Section 7 that our method is still effective in practice. We leave the exploration of\nalternative optimization procedures that do not require this projection step to future work.\nChoice of inner products We employ weighted inner products of the form (cid:104)Ln,Lm(cid:105)\u02c6\u03c0 =\nE\u02c6\u03c0 [(cid:104)Ln,Lm(cid:105)], where we choose \u02c6\u03c0 to be the current posterior p(\u03b8|D0). We consider two spe-\nci\ufb01c inner products with desirable analytical and computational properties; however, other choices\nare possible. First, we de\ufb01ne the weighted Fisher inner product [17, 10]\n\nm = 1 if w\u2217\n\n(cid:2)\u2207\u03b8Ln(\u03b8)T\u2207\u03b8Lm(\u03b8)(cid:3) ,\n\n(cid:104)Ln,Lm(cid:105)\u02c6\u03c0,F = E\n\n\u02c6\u03c0\n\n(8)\n\nwhich is reminiscent of information-theoretic quantities but requires taking gradients of the expected\nlog-likelihood terms1 w.r.t. the parameters. In Section 4, we show that for speci\ufb01c models this choice\nleads to simple, interpretable expressions that are closely related to existing AL procedures.\nAn alternative choice that lifts the restriction of having to compute gradients is the weighted Euclidean\ninner product, which considers the marginal likelihood of data points [10],\n\n(cid:104)Ln,Lm(cid:105)\u02c6\u03c0,2 = E\n\n\u02c6\u03c0\n\n[Ln(\u03b8)Lm(\u03b8)] .\n\n(9)\n\nThe key advantage of this inner product is that it only requires tractable likelihood computations. In\nSection 5 this will prove highly useful in providing a black-box method for these computations in any\nmodel (that has a tractable likelihood) using random feature projections.\nMethod overview In summary, we (i) consider the Lm in Eq. (4) as vectors in function space\nand re-cast batch construction as a sparse approximation to the full data log posterior from Eq. (5);\n(ii) replace the cardinality constraint with a polytope constraint in a Hilbert space, and relax the\nbinary weight constraint to non-negativity; (iii) solve the resulting optimization problem in Eq. (6)\nusing Algorithm A.1; (iv) construct the AL batch by including all points xm \u2208 Xp with w\u2217\n\nm > 0.\n\n1Note that the entropy term in Lm (see Eq. (4)) vanishes under this norm as the gradient for \u03b8 is zero.\n\n4\n\n\f4 Analytic expressions for linear models\n\nIn this section, we use the weighted Fisher inner product from Eq. (8) to derive closed-form expres-\nsions of the key quantities of our algorithm for two types of models: Bayesian linear regression and\nprobit regression. Although the considered models are relatively simple, they can be used \ufb02exibly to\nconstruct more powerful models that still admit closed-form solutions. For example, in Section 7\nwe demonstrate how using neural linear models [18, 19] allows to perform ef\ufb01cient AL on several\nregression tasks. We consider arbitrary models and inference procedures in Section 5.\n\nLinear regression Consider the following model for scalar Bayesian linear regression,\n\nyn = \u03b8T xn + \u0001n,\n\n\u0001n \u223c N (0, \u03c32\n0),\n\n\u03b8 \u223c p(\u03b8),\n\n(10)\n\nN(cid:0)\u03b8; (X T X + \u03c32\n\nwhere p(\u03b8) is a factorized Gaussian prior with unit variance; extensions to richer Gaussian priors are\nstraightforward. Given a labeled dataset D0, the posterior is given in closed form as p(\u03b8|D0, \u03c32\n0) =\n0I)\u22121. For this model, a closed-form\n\n0I)\u22121X T y, \u03a3\u03b8\n\n0(X T X + \u03c32\n\n(cid:1) with \u03a3\u03b8 = \u03c32\n\nexpression for the inner product in Eq. (8) is\n\nxT\n\n(cid:104)Ln,Lm(cid:105)\u02c6\u03c0,F =\nwhere \u02c6\u03c0 is chosen to be the posterior p(\u03b8|D0, \u03c32\n0). See Appendix B.1 for details on this derivation.\nWe can make a direct comparison with BALD [2, 4] by treating the squared norm of a data point\nwith itself as a greedy acquisition function,2 \u03b1ACS(xn;D0) = (cid:104)Ln,Ln(cid:105)\u02c6\u03c0,F , yielding,\nn \u03a3\u03b8xn\n\nn xm\n\u03c34\n0\n\nn \u03a3\u03b8xm,\n\n(cid:18)\n\n(cid:19)\n\n(11)\n\nxT\n\nxT\n\nxT\n\n\u03b1ACS(xn;D0) =\n\nn xn\n\u03c34\n0\n\nxT\n\nn \u03a3\u03b8xn,\n\n\u03b1BALD(xn;D0) =\n\n1\n2\n\nlog\n\n1 +\n\n\u03c32\n0\n\n.\n\n(12)\n\nn xn. Ignoring the xT\n\nn \u03a3\u03b8xn, but BALD wraps the term in a logarithm whereas\nThe two functions share the term xT\n\u03b1ACS scales it by xT\nn xn term in \u03b1ACS makes the two quantities proportional\u2014\nexp(2\u03b1BALD(xn;D0)) \u221d \u03b1ACS(xn;D0)\u2014and thus equivalent under a greedy maximizer. Another\nn \u03a3\u03b8xn is very similar to a leverage score [20\u201322], which is computed as\nobservation is that xT\nn (X T X)\u22121xn and quanti\ufb01es the degree to which xn in\ufb02uences the least-squares solution. We\nxT\ncan then interpret the xT\nn xn term in \u03b1ACS as allowing for more contribution from the current instance\nxn than BALD or leverage scores would.\n\np(yn|xn, \u03b8) = Ber(cid:0)\u03a6(\u03b8T xn)(cid:1) ,\n\nProbit regression Consider the following model for Bayesian probit regression,\n\n(13)\nwhere \u03a6(\u00b7) represents the standard Normal cumulative density function (cdf), and p(\u03b8) is assumed to\nbe a factorized Gaussian with unit variance. We obtain a closed-form solution for Eq. (8), i.e.\n\n\u03b8 \u223c p(\u03b8),\n\n(cid:104)Ln,Lm(cid:105)\u02c6\u03c0,F = xT\n\u00b5T\n\n\u03b8 xi\n\n(cid:112)1 + xT\n\ni \u03a3\u03b8xi\n\nn xm\n\nBvN (\u03b6n, \u03b6m, \u03c1n,m) \u2212 \u03a6(\u03b6n)\u03a6(\u03b6m)\n\n(cid:112)1 + xT\n\nxT\n\nn \u03a3\u03b8xm\n\n(cid:112)1 + xT\n\nn \u03a3\u03b8xn\n\nm\u03a3\u03b8xm\n\n\u03b6i =\n\n\u03c1n,m =\n\n(14)\n\n(cid:17)\n\n,\n\n(cid:16)\n\nwhere BvN(\u00b7) is the bi-variate Normal cdf. We again view \u03b1ACS(xn;D0) = (cid:104)Ln,Ln(cid:105)\u02c6\u03c0,F as an\nacquisition function and re-write Eq. (14) as\n\n(cid:32)\n\n(cid:32)\n\n(cid:33)(cid:33)\n\n(cid:112)1 + 2xT\n\n1\n\nn \u03a3\u03b8xn\n\n,\n\n(15)\n\n\u03b1ACS(xn;D0) = xT\n\nn xn\n\n\u03a6 (\u03b6n) (1 \u2212 \u03a6 (\u03b6n)) \u2212 2T\n\n\u03b6n,\n\nwhere T(\u00b7,\u00b7) is Owen\u2019s T function [23]. See Appendix B.2 for the full derivation of Eqs. (14) and (15).\nEq. (15) has a simple and intuitive form that accounts for the magnitude of the input vector and a\nregularized term for the predictive variance.\n\n2We only introduce \u03b1ACS to compare to other acquisition functions; in practice we use Algorithm A.1.\n\n5\n\n\f5 Random projections for non-linear models\n\nIn Section 4, we have derived closed-form expressions of the weighted Fisher inner product for two\nspeci\ufb01c types of models. However, this approach suffers from two shortcomings. First, it is limited to\nmodels for which the inner product can be evaluated in closed form, e.g. linear regression or probit\n\nregression. Second, the resulting algorithm requires O(cid:0)|P|2(cid:1) computations to construct a batch,\n\nrestricting our approach to moderately-sized pool sets.\nWe address both of these issues using random feature projections, allowing us to approximate the\nkey quantities required for the batch construction. In Algorithm A.2, we introduce a procedure that\nworks for any model with a tractable likelihood, scaling only linearly in the pool set size |P|. To keep\nthe exposition simple, we consider models in which the expectation of Ln(\u03b8) w.r.t. p(yn|xn,D0) is\ntractable, but we stress that our algorithm could work with sampling for that expectation as well.\nWhile it is easy to construct a projection for the weighted Fisher inner product [10], its dependence\non the number of model parameters through the gradient makes it dif\ufb01cult to scale it to more complex\nmodels. We therefore only consider projections for the weighted Euclidean inner product from\nEq. (9), which we found to perform comparably in practice. The appropriate projection is [10]\n\n(16)\ni.e. \u02c6Ln represents the J-dimensional projection of Ln in Euclidean space. Given this projection, we\nare able to approximate inner products as dot products between vectors,\n\n[Ln(\u03b81),\u00b7\u00b7\u00b7 ,Ln(\u03b8J )]T ,\n\n\u02c6Ln =\n\n1\u221a\nJ\n\n\u03b8j \u223c \u02c6\u03c0,\n\n(cid:104)Ln,Lm(cid:105)\u02c6\u03c0,2 \u2248 \u02c6LT\n\n\u02c6Lm,\n\nn\n\n(17)\n\u02c6Lm can be viewed as an unbiased sample estimator of (cid:104)Ln,Lm(cid:105)\u02c6\u03c0,2 using J Monte Carlo\nwhere \u02c6LT\nsamples from the posterior \u02c6\u03c0. Importantly, Eq. (16) can be calculated for any model with a tractable\nlikelihood. Since in practice we only require inner products of the form (cid:104)L \u2212 L(w),Ln/\u03c3n(cid:105)\u02c6\u03c0,2,\nbatches can be ef\ufb01ciently constructed in O(|P|J) time. As we show in Section 7, this enables us to\nscale our algorithm up to pool sets comprising hundreds of thousands of examples.\n\nn\n\n6 Related work\n\nBayesian AL approaches attempt to query points that maximally reduce model uncertainty. Common\nheuristics to this intractable problem greedily choose points where the predictive posterior is most\nuncertain, e.g. maximum variance and maximum entropy [3], or that maximally improve the expected\ninformation gain [2, 4]. Scaling these methods to the batch setting in a principled way is dif\ufb01cult for\ncomplex, non-linear models. Recent work on improving inference for AL with deep probabilistic\nmodels [24, 13] used datasets with at most 10 000 data points and few model updates.\nConsequently, there has been great interest in batch AL recently. The literature is dominated by\nnon-probabilistic methods, which commonly trade off diversity and uncertainty. Many approaches are\nmodel-speci\ufb01c, e.g. for linear regression [25], logistic regression [26, 27], and k-nearest neighbors\n[28]; our method works for any model with a tractable likelihood. Others [6\u20138] follow optimization-\nbased approaches that require optimization over a large number of variables. As these methods scale\nquadratically with the number of data points, they are limited to smaller pool sets.\nProbabilistic batch methods mostly focus on Bayesian optimization problems. Several approaches\nselect the batch that jointly optimizes the acquisition function [29, 30]. As they scale poorly with\nthe batch size, greedy batch construction algorithms are often used instead [31\u201334]. A common\nstrategy is to impute the labels of the selected data points and update the model accordingly [33].\nOur approach also uses the model to predict the labels, but importantly it does not require to update\nthe model after every data point. Moreover, most of the methods in Bayesian optimization employ\nGaussian process models. While AL with non-parametric models [35] could bene\ufb01t from that work,\nscaling such models to large datasets remains challenging. Our work therefore provides the \ufb01rst\nprincipled, scalable and model-agnostic Bayesian batch AL approach.\nSimilar to us, Sener and Savarese [5] formulate AL as a core-set selection problem. They construct\nbatches by solving a k-center problem, attempting to minimize the maximum distance to one of the\nk queried data points. Since this approach heavily relies on the geometry in data space, it requires\n\n6\n\n\fBALD\n\nACS-FW\n\n(a) t = 1\n\n(b) t = 2\n\n(c) t = 3\n\n(d) t = 10\n\n(e) t = 1\n\n(f) t = 2\n\n(g) t = 3\n\n(h) t = 10\n\nFigure 2: Batches constructed by BALD (top) and ACS-FW (bottom) on a probit regression task.\n10 training data points (red, blue) were sampled from a standard bi-variate Normal, and labeled\naccording to p(y|x) = Ber(5x1 + 0x2). At each step t, one unlabeled point (black cross) is queried\nfrom the pool set (colored according to acquisition function4; bright is higher). The current mean\ndecision boundary of the model is shown as a black line. Best viewed in color.\n\nan expressive feature representation. For example, Sener and Savarese [5] only consider ConvNet\nrepresentations learned on highly structured image data. In contrast, our work is inspired by Bayesian\ncoresets [9, 10], which enable scalable Bayesian inference by approximating the log-likelihood of a\nlabeled dataset with a sparse weighted subset thereof. Consequently, our method is less reliant on a\nstructured feature space and only requires to evaluate log-likelihood terms.\n\n7 Experiments and results\n\nWe perform experiments3 to answer the following questions: (1) does our approach avoid correlated\nqueries, (2) is our method competitive with greedy methods in the small-data regime, and (3) does\nour method scale to large datasets and models? We address questions (1) and (2) on several linear\nand probit regression tasks using the closed-form solutions derived in Section 4, and question (3)\non large-scale regression and classi\ufb01cation datasets by leveraging the projections from Section 5.\nFinally, we provide a runtime evaluation for all regression experiments. Full experimental details are\ndeferred to Appendix C.\n\nDoes our approach avoid correlated queries? In Fig. 1, we have seen that traditional AL methods\nare prone to correlated queries. To investigate this further, in Fig. 2 we compare batches selected\nby ACS-FW and BALD on a simple probit regression task. Since BALD has no explicit batch\nconstruction mechanism, we naively choose the b = 10 most informative points according to BALD.\nWhile the BALD acquisition function does not change during batch construction, \u03b1ACS(xn;D0)\nrotates after each selected data point. This provides further intuition about why ACS-FW is able to\nspread the batch in data space, avoiding the strongly correlated queries that BALD produces.\n\nIs our method competitive with greedy methods in the small-data regime? We evaluate the\nperformance of ACS-FW on several UCI regression datasets. We compare against (i) RANDOM:\nselect points randomly; (ii) MAXENT: naively construct batch using top b points according to\nmaximum entropy criterion (equivalent to BALD in this case); (iii) MAXENT-SG: use MAXENT\nwith sequential greedy strategy (i.e. b = 1); (iv) MAXENT-I: sequentially acquire single data\npoint, impute missing label and update model accordingly. Starting with 20 labeled points sampled\nrandomly from the pool set, we use each AL method to iteratively grow the training dataset by\nrequesting batches of size b = 10 until the budget of 100 queries is exhausted. To guarantee fair\ncomparisons, all methods use the same neural linear model, i.e. a Bayesian linear regression model\nwith a deterministic neural network feature extractor [19]. In this setting, posterior inference can be\n\n3Source code is available at https://github.com/rpinsler/active-bayesian-coresets.\n4We use \u03b1ACS (see Eq. (15)) as an acquisition function for ACS-FW only for the sake of visualization.\n\n7\n\n\fTable 1: Final test RMSE on UCI regression datasets averaged over 40 (year: 5) seeds. MAXENT-I\nand MAXENT-SG require order(s) of magnitudes more model updates and are thus not directly\ncomparable.\nN\n308\n506\n768\n9568\n515 345\n\nACS-FW\n1.031\u00b10.0438\n3.799\u00b10.0858\n0.855\u00b10.0259\n4.984\u00b10.0366\n12.194\u00b10.0596\n\nRANDOM\n1.272\u00b10.0593\n4.068\u00b10.0852\n0.959\u00b10.0337\n5.108\u00b10.0468\n13.165\u00b10.0307\n\nMAXENT\n\n0.923\u00b10.0319\n3.640\u00b10.0652\n1.443\u00b10.0857\n5.022\u00b10.0428\n13.030\u00b10.0975\n\nMAXENT-I\n0.865\u00b10.0276\n3.467\u00b10.0676\n0.927\u00b10.0461\n4.834\u00b10.0313\nN/A\n\nMAXENT-SG\n0.971\u00b10.0350\n3.458\u00b10.0682\n1.055\u00b10.0740\n4.855\u00b10.0339\nN/A\n\nyacht\nboston\nenergy\npower\nyear\n\nd\n6\n13\n8\n4\n90\n\nTable 2: Runtime in seconds on UCI regression datasets averaged over 40 (year: 5) seeds. We\nreport mean batch construction time (BT/it.) and total time (TT/it.) per AL iteration, as well as total\ncumulative time (total). MAXENT-I requires order(s) of magnitudes more model updates and is thus\nnot directly comparable.\n\nRANDOM\n\nBT/it. TT/it.\n8.9\n12.4\n12.1\n9.4\n381.2\n\n0.0\n0.0\n0.0\n0.4\n30.2\n\ntotal\n88.6\n123.6\n121.4\n94.0\n3811.6\n\nBT/it.\n1.3\n2.4\n3.9\n53.0\n3391.5\n\nyacht\nboston\nenergy\npower\nyear\n\nMAXENT\nTT/it.\n10.2\n14.5\n16.0\n61.7\n3746.5\n\ntotal\n101.7\n144.8\n159.6\n617.0\n37 464.6\n\nACS-FW\n\nMAXENT-I\n\nBT/it. TT/it.\n9.1\n12.4\n12.6\n10.2\n463.8\n\n0.0\n0.1\n0.1\n0.8\n53.0\n\ntotal\n107.2\n132.7\n137.8\n179.8\n28 475.2\n\nBT/it. TT/it.\n105.7\n12.3\n157.9\n23.5\n37.5\n170.5\n609.1\n517.3\nN/A\nN/A\n\ntotal\n1057.4\n1578.6\n1704.9\n6090.7\nN/A\n\ndone in closed form [19]. The model is re-trained for 1000 epochs after every AL iteration using\nAdam [36]. After each iteration, we evaluate RMSE on a held-out set. Experiments are repeated for\n40 seeds, using randomized 80/20% train-test splits. We also include a medium-scale experiment\non power that follows the same protocol; however, for ACS-FW we use projections instead of the\nclosed-form solutions as they yield improved performance and are faster. Further details, including\narchitectures and learning rates, are in Appendix C.\nThe results are summarized in Table 1. ACS-FW consistently outperforms RANDOM by a large\nmargin (unlike MAXENT), and is mostly on par with MAXENT on smaller datasets. While the results\nare encouraging, greedy methods such as MAXENT-SG and MAXENT-I still often yield better results\nin these small-data regimes. We conjecture that this is because single data points do have signi\ufb01cant\nimpact on the posterior. The bene\ufb01ts of using ACS-FW become clearer with increasing dataset size:\nas shown in Fig. 3, ACS-FW achieves much more data-ef\ufb01cient learning on larger datasets.\n\n(a) yacht\n\n(b) energy\n\n(c) year\n\nFigure 3: Test RMSE on UCI regression datasets averaged over 40 (a-b) and 5 (c) seeds during AL.\nError bars denote two standard errors.\n\nDoes our method scale to large datasets and models? Leveraging the projections from Section 5,\nwe apply ACS-FW to large-scale datasets and complex models. We demonstrate the bene\ufb01ts of our\napproach on year, a UCI regression dataset with ca. 515 000 data points, and on the classi\ufb01cation\ndatasets cifar10, SVHN and Fashion MNIST. Methods requiring model updates after every data point\n(e.g. MAXENT-SG, MAXENT-I) are impractical in these settings due to their excessive runtime.\nFor year, we again use a neural linear model, start with 200 labeled points and allow for batches of\nsize b = 1000 until the budget of 10 000 queries is exhausted. We average the results over 5 seeds,\n\n8\n\n020406080100Number of samples from pool set0.01.53.04.56.07.5RMSE020406080100Number of samples from pool set0.51.52.53.54.5RMSE0200040006000800010000Number of samples from pool set121416182022RMSEACS-FW (ours)MaxEntRandom\f(a) cifar10\n\n(b) SVHN\n\n(c) Fashion MNIST\n\nFigure 4: Test accuracy on classi\ufb01cation tasks over 5 seeds. Error bars denote two standard errors.\n\nusing randomized 80/20% train-test splits. As can be seen in Fig. 3c, our approach signi\ufb01cantly\noutperforms both RANDOM and MAXENT during the entire AL process.\nFor the classi\ufb01cation experiments, we start with 1000 (cifar10: 5000) labeled points and request\nbatches of size b = 3000 (5000), up to a budget of 12 000 (20 000) points. We compare to RANDOM,\nMAXENT and BALD, as well as two batch AL algorithms, namely K-MEDOIDS and K-CENTER\n[5]. Performance is measured in terms of accuracy on a holdout test set comprising 10 000 (Fashion\nMNIST: 26 032, as is standard) points, with the remainder used for training. We use a neural linear\nmodel with a ResNet18 [16] feature extractor, trained from scratch at every AL iteration for 250\nepochs using Adam [36]. Since posterior inference is intractable in the multi-class setting, we resort\nto variational inference with mean-\ufb01eld Gaussian approximations [37, 38].\nFig. 4 demonstrates that in all cases ACS-FW signi\ufb01cantly outperforms RANDOM, which is a strong\nbaseline in AL [5, 13, 24]. Somewhat surprisingly, we \ufb01nd that the probabilistic methods (BALD\nand MAXENT), provide strong baselines as well, and consistently outperform RANDOM. We discuss\nthis point and provide further experimental results in Appendix D. Finally, Fig. 4 demonstrates\nthat in all cases ACS-FW performs at least as well as its competitors, including state-of-the-art\nnon-probabilistic batch AL approaches such as K-CENTER. These results demonstrate that ACS-FW\ncan usefully apply probabilistic reasoning to AL at scale, without any sacri\ufb01ce in performance.\n\nRuntime Evaluation Runtime comparisons between different AL methods on the UCI regression\ndatasets are shown in Table 2. For methods with \ufb01xed AL batch size b (RANDOM, MAXENT and\nMAXENT-I), the number of AL iterations is given by the total budget divided by b (e.g. 100/10 = 10\nfor yacht). Thus, the total cumulative time (total) is given by the total time per AL iteration (TT/it.)\ntimes the number of iterations. MAXENT-I iteratively constructs the batch by selecting a single data\npoint, imputing its label, and updating the model; therefore the batch construction time (BT/it.) and\nthe total time per AL iteration take roughly b times as long as for MAXENT (e.g. 10x for yacht). This\napproach becomes infeasible for very large batch sizes (e.g. 1000 for year). The same holds true for\nMAXENT-SG, which we have omitted here as the runtimes are similar to MAXENT-I. ACS-FW\nconstructs batches of variable size, and hence the number of iterations varies.\nAs shown in Table 2, the batch construction times of ACS-FW are negligble compared to the total\ntraining times per AL iteration. Although ACS-FW requires more AL iterations than the other\nmethods, the total cumulative runtimes are on par with MAXENT. Note that both MAXENT and\nMAXENT-I require to compute the entropy of a Student\u2019s T distribution, for which no batch version\nwas available in PyTorch as we performed the experiments. Parallelizing this computation would\nlikely further speed up the batch construction process.\n\n8 Conclusion and future work\n\nWe have introduced a novel Bayesian batch AL approach based on sparse subset approximations.\nOur methodology yields intuitive closed-form solutions, revealing its connection to BALD as well as\nleverage scores. Yet more importantly, our approach admits relaxations (i.e. random projections) that\nallow it to tackle challenging large-scale AL problems with general non-linear probabilistic models.\nLeveraging the Frank-Wolfe weights in a principled way and investigating how this method interacts\nwith alternative approximate inference procedures are interesting avenues for future work.\n\n9\n\n05000100001500020000Number of samples from pool set0.700.730.760.790.820.850.88Accuracy020004000600080001000012000Number of samples from pool set0.700.750.800.850.900.95Accuracy020004000600080001000012000Number of samples from pool set0.820.840.860.880.900.920.94AccuracyACS-FW (ours)BALDK-CenterK-MedoidsMaxEntRandom\fAcknowledgments\n\nRobert Pinsler receives funding from iCASE grant #1950384 with support from Nokia. Jonathan\nGordon, Eric Nalisnick and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato were funded by Samsung Research,\nSamsung Electronics Co., Seoul, Republic of Korea. We thank Adri\u00e0 Garriga-Alonso, James\nRequeima, Marton Havasi, Carl Edward Rasmussen and Trevor Campbell for helpful feedback and\ndiscussions.\n\nReferences\n[1] Burr Settles. Active learning. Synthesis Lectures on Arti\ufb01cial Intelligence and Machine\n\nLearning, 6(1):1\u2013114, 2012.\n\n[2] David JC MacKay. Information-based objective functions for active data selection. Neural\n\ncomputation, 4(4):590\u2013604, 1992.\n\n[3] Claude Elwood Shannon. A mathematical theory of communication. Bell System Technical\n\nJournal, 27(3):379\u2013423, 1948.\n\n[4] Neil Houlsby, Ferenc Husz\u00e1r, Zoubin Ghahramani, and M\u00e1t\u00e9 Lengyel. Bayesian active learning\n\nfor classi\ufb01cation and preference learning. arXiv Preprint arXiv:1112.5745, 2011.\n\n[5] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set\n\napproach. In International Conference on Learning Representations, 2018.\n\n[6] Ehsan Elhamifar, Guillermo Sapiro, Allen Yang, and S Shankar Sasrty. A convex optimization\nframework for active learning. In IEEE International Conference on Computer Vision, pages\n209\u2013216, 2013.\n\n[7] Yuhong Guo. Active instance sampling via matrix partition. In Advances in Neural Information\n\nProcessing Systems, pages 802\u2013810, 2010.\n\n[8] Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G Hauptmann. Multi-class\nactive learning by uncertainty sampling with diversity maximization. International Journal of\nComputer Vision, 113(2):113\u2013127, 2015.\n\n[9] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable Bayesian\nlogistic regression. In Advances in Neural Information Processing Systems, pages 4080\u20134088,\n2016.\n\n[10] Trevor Campbell and Tamara Broderick. Automated scalable Bayesian inference via Hilbert\n\ncoresets. The Journal of Machine Learning Research, 20(1):551\u2013588, 2019.\n\n[11] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research\n\nLogistics Quarterly, 3(1-2):95\u2013110, 1956.\n\n[12] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine\n\nLearning Research, 9(Nov):2579\u20132605, 2008.\n\n[13] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian active learning with image\n\ndata. arXiv Preprint arXiv:1703.02910, 2017.\n\n[14] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in\nactive learning and stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42:\n427\u2013486, 2011.\n\n[15] Sanjoy Dasgupta. Analysis of a greedy active learning strategy.\n\nInformation Processing Systems, pages 337\u2013344, 2005.\n\nIn Advances in Neural\n\n[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n10\n\n\f[17] Oliver Johnson and Andrew Barron. Fisher information inequalities and the central limit\n\ntheorem. Probability Theory and Related Fields, 129(3):391\u2013409, 2004.\n\n[18] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel\n\nlearning. In Arti\ufb01cial Intelligence and Statistics, pages 370\u2013378, 2016.\n\n[19] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian bandits showdown. In\n\nInternational Conference on Learning Representations, 2018.\n\n[20] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast\napproximation of matrix coherence and statistical leverage. Journal of Machine Learning\nResearch, 13(Dec):3475\u20133506, 2012.\n\n[21] Ping Ma, Michael W Mahoney, and Bin Yu. A statistical perspective on algorithmic leveraging.\n\nThe Journal of Machine Learning Research, 16(1):861\u2013911, 2015.\n\n[22] Michal Derezinski, Manfred K Warmuth, and Daniel J Hsu. Leveraged volume sampling for\nlinear regression. In Advances in Neural Information Processing Systems, pages 2510\u20132519,\n2018.\n\n[23] Donald B Owen. Tables for computing bivariate normal probabilities. The Annals of Mathemat-\n\nical Statistics, 27(4):1075\u20131090, 1956.\n\n[24] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable\nlearning of Bayesian neural networks. In International Conference on Machine Learning, pages\n1861\u20131869, 2015.\n\n[25] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In\n\nInternational Conference on Machine Learning, pages 1081\u20131088, 2006.\n\n[26] Steven CH Hoi, Rong Jin, Jianke Zhu, and Michael R Lyu. Batch mode active learning and its\napplication to medical image classi\ufb01cation. In International Conference on Machine Learning,\npages 417\u2013424, 2006.\n\n[27] Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. In Advances in\n\nNeural Information Processing Systems, pages 593\u2013600, 2008.\n\n[28] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active\n\nlearning. In International Conference on Machine Learning, pages 1954\u20131963, 2015.\n\n[29] Cl\u00e9ment Chevalier and David Ginsbourger. Fast computation of the multi-points expected\nimprovement with applications in batch selection. In International Conference on Learning and\nIntelligent Optimization, pages 59\u201369, 2013.\n\n[30] Amar Shah and Zoubin Ghahramani. Parallel predictive entropy search for batch global\noptimization of expensive objective functions. In Advances in Neural Information Processing\nSystems, pages 3330\u20133338, 2015.\n\n[31] Javad Azimi, Alan Fern, and Xiaoli Z Fern. Batch Bayesian optimization via simulation\n\nmatching. In Advances in Neural Information Processing Systems, pages 109\u2013117, 2010.\n\n[32] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussian\nprocess optimization with upper con\ufb01dence bound and pure exploration. In Joint European\nConference on Machine Learning and Knowledge Discovery in Databases, pages 225\u2013240,\n2013.\n\n[33] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation\ntradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research,\n15(1):3873\u20133923, 2014.\n\n[34] Javier Gonz\u00e1lez, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch Bayesian optimiza-\n\ntion via local penalization. In Arti\ufb01cial Intelligence and Statistics, pages 648\u2013657, 2016.\n\n11\n\n\f[35] Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. Active learning with\nGaussian processes for object categorization. In IEEE International Conference on Computer\nVision, pages 1\u20138, 2007.\n\n[36] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv Preprint\n\narXiv:1412.6980, 2014.\n\n[37] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and\nvariational inference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n[38] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\n\nin neural networks. arXiv Preprint arXiv:1505.05424, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3431, "authors": [{"given_name": "Robert", "family_name": "Pinsler", "institution": "University of Cambridge"}, {"given_name": "Jonathan", "family_name": "Gordon", "institution": "University of Cambridge"}, {"given_name": "Eric", "family_name": "Nalisnick", "institution": "University of Cambridge"}, {"given_name": "Jos\u00e9 Miguel", "family_name": "Hern\u00e1ndez-Lobato", "institution": "University of Cambridge"}]}