{"title": "Statistical Active Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1295, "page_last": 1303, "abstract": "We describe a framework for designing efficient active learning algorithms that are tolerant to random classification noise. The framework is based on active learning algorithms that are statistical in the sense that they rely on estimates of expectations of functions of filtered random examples. It builds on the powerful statistical query framework of Kearns (1993). We show that any efficient active statistical learning algorithm can be automatically converted to an efficient active learning algorithm which is tolerant to random classification noise as well as other forms of uncorrelated\" noise. The complexity of the resulting algorithms has information-theoretically optimal quadratic dependence on $1/(1-2\\eta)$, where $\\eta$ is the noise rate. We demonstrate the power of our framework by showing that commonly studied concept classes including thresholds, rectangles, and linear separators can be efficiently actively learned in our framework. These results combined with our generic conversion lead to the first known computationally-efficient algorithms for actively learning some of these concept classes in the presence of random classification noise that provide exponential improvement in the dependence on the error $\\epsilon$ over their passive counterparts. In addition, we show that our algorithms can be automatically converted to efficient active differentially-private algorithms. This leads to the first differentially-private active learning algorithms with exponential label savings over the passive case.\"", "full_text": "Statistical Active Learning Algorithms\n\nMaria Florina Balcan\n\nGeorgia Institute of Technology\nninamf@cc.gatech.edu\n\nVitaly Feldman\n\nIBM Research - Almaden\n\nvitaly@post.harvard.edu\n\nAbstract\n\nWe describe a framework for designing ef\ufb01cient active learning algorithms that are\ntolerant to random classi\ufb01cation noise and differentially-private. The framework\nis based on active learning algorithms that are statistical in the sense that they rely\non estimates of expectations of functions of \ufb01ltered random examples. It builds\non the powerful statistical query framework of Kearns [30].\nWe show that any ef\ufb01cient active statistical learning algorithm can be automati-\ncally converted to an ef\ufb01cient active learning algorithm which is tolerant to ran-\ndom classi\ufb01cation noise as well as other forms of \u201cuncorrelated\u201d noise. We show\nthat commonly studied concept classes including thresholds, rectangles, and lin-\near separators can be ef\ufb01ciently actively learned in our framework. These results\ncombined with our generic conversion lead to the \ufb01rst computationally-ef\ufb01cient\nalgorithms for actively learning some of these concept classes in the presence of\nrandom classi\ufb01cation noise that provide exponential improvement in the depen-\ndence on the error \u0001 over their passive counterparts. In addition, we show that our\nalgorithms can be automatically converted to ef\ufb01cient active differentially-private\nalgorithms. This leads to the \ufb01rst differentially-private active learning algorithms\nwith exponential label savings over the passive case.\n\n1\n\nIntroduction\n\nMost classic machine learning methods depend on the assumption that humans can annotate all the\ndata available for training. However, many modern machine learning applications have massive\namounts of unannotated or unlabeled data. As a consequence, there has been tremendous interest\nboth in machine learning and its application areas in designing algorithms that most ef\ufb01ciently uti-\nlize the available data, while minimizing the need for human intervention. An extensively used and\nstudied technique is active learning, where the algorithm is presented with a large pool of unlabeled\nexamples and can interactively ask for the labels of examples of its own choosing from the pool,\nwith the goal to drastically reduce labeling effort. This has been a major area of machine learning\nresearch in the past decade [19], with several exciting developments on understanding its underlying\nstatistical principles [27, 18, 4, 3, 29, 21, 15, 7, 31, 10, 34, 6]. In particular, several general character-\nizations have been developed for describing when active learning can in principle have an advantage\nover the classic passive supervised learning paradigm, and by how much. However, these efforts\nwere primarily focused on sample size bounds rather than computation, and as a result many of the\nproposed algorithms are not computationally ef\ufb01cient. The situation is even worse in the presence\nof noise where active learning appears to be particularly hard. In particular, prior to this work, there\nwere no known ef\ufb01cient active algorithms for concept classes of super-constant VC-dimension that\nare provably robust to random and independent noise while giving improvements over the passive\ncase.\nOur Results: We propose a framework for designing ef\ufb01cient (polynomial time) active learning\nalgorithms which is based on restricting the way in which examples (both labeled and unlabeled) are\naccessed by the algorithm. These restricted algorithms can be easily simulated using active sampling\nand, in addition, possess a number of other useful properties. The main property we will consider is\n\n1\n\n\ftolerance to random classi\ufb01cation noise of rate \u03b7 (each label is \ufb02ipped randomly and independently\nwith probability \u03b7 [1]). Further, as we will show, the algorithms are tolerant to other forms of noise\nand can be simulated in a differentially-private way.\nIn our restriction, instead of access to random examples from some distribution P over X \u00d7 Y the\nlearning algorithm only gets \u201cactive\u201d estimates of the statistical properties of P in the following\nsense. The algorithm can choose any \ufb01lter function \u03c7(x) : X \u2192 [0, 1] and a query function \u03c6 :\nX \u00d7 Y \u2192 [\u22121, 1] for any \u03c7 and \u03c6. For simplicity we can think of \u03c7 as an indicator function of\nsome set \u03c7S \u2286 X of \u201cinformative\u201d points and of \u03c6 as some useful property of the target function.\nFor this pair of functions the learning algorithm can get an estimate of E(x,y)\u223cP [\u03c6(x, y) | x \u2208 \u03c7S].\nFor \u03c4 and \u03c40 chosen by the algorithm the estimate is provided to within tolerance \u03c4 as long as\nE(x,y)\u223cP [x \u2208 \u03c7S] \u2265 \u03c40 (nothing is guaranteed otherwise). Here the inverse of \u03c4 corresponds to\nthe label complexity of the algorithm and the inverse of \u03c40 corresponds to its unlabeled sample\ncomplexity. Such a query is referred to as active statistical query (SQ) and algorithms using active\nSQs are referred to as active statistical algorithms.\nOur framework builds on the classic statistical query (SQ) learning framework of Kearns [30] de-\n\ufb01ned in the context of PAC learning model [35]. The SQ model is based on estimates of expectations\nof functions of examples (but without the additional \ufb01lter function) and was de\ufb01ned in order to de-\nsign ef\ufb01cient noise tolerant algorithms in the PAC model. Despite the restrictive form, most of the\nlearning algorithms in the PAC model and other standard techniques in machine learning and statis-\ntics used for problems over distributions have SQ analogues [30, 12, 11, ?]1. Further, statistical\nalgorithms enjoy additional properties: they can be simulated in a differentially-private way [11],\nautomatically parallelized on multi-core architectures [17] and have known information-theoretic\ncharacterizations of query complexity [13, 26]. As we show, our framework inherits the strengths of\nthe SQ model while, as we will argue, capturing the power of active learning.\nAt a \ufb01rst glance being active and statistical appear to be incompatible requirements on the algorithm.\nActive algorithms typically make label query decisions on the basis of examining individual samples\n(for example as in binary search for learning a threshold or the algorithms in [27, 21, 22]). At the\nsame time statistical algorithms can only examine properties of the underlying distribution. But\nthere also exist a number of active learning algorithms that can be seen as applying passive learning\ntechniques to batches of examples that are obtained from querying labels of samples that satisfy the\nsame \ufb01lter. These include the general A2 algorithm [4] and, for example, algorithms in [3, 20, 9, 8].\nAs we show, we can build on these techniques to provide algorithms that \ufb01t our framework.\nWe start by presenting a general reduction showing that any ef\ufb01cient active statistical learning al-\ngorithm can be automatically converted to an ef\ufb01cient active learning algorithm which is tolerant to\nrandom classi\ufb01cation noise as well as other forms of \u201cuncorrelated\u201d noise. We then demonstrate the\ngenerality of our framework by showing that the most commonly studied concept classes includ-\ning thresholds, balanced rectangles, and homogenous linear separators can be ef\ufb01ciently actively\nlearned via active statistical algorithms. For these concept classes, we design ef\ufb01cient active learn-\ning algorithms that are statistical and provide the same exponential improvements in the dependence\non the error \u0001 over passive learning as their non-statistical counterparts.\nThe primary problem we consider is active learning of homogeneous halfspaces, a problem that has\nattracted a lot of interest in the theory of active learning [27, 18, 3, 9, 22, 16, 23, 8, 28]. We de-\nscribe two algorithms for the problem. First, building on insights from margin based analysis [3, 8],\nwe give an active statistical learning algorithm for homogeneous halfspaces over all isotropic log-\nconcave distributions, a wide class of distributions that includes many well-studied density functions\nand has played an important role in several areas including sampling, optimization, and learning\n[32]. Our algorithm for this setting proceeds in rounds; in round t we build a better approximation\nwt to the target function by using a passive SQ learning algorithm (e.g., the one of [24]) over a\ndistribution Dt that is a mixture of distributions in which each component is the original distribution\nconditioned on being within a certain distance from the hyperplane de\ufb01ned by previous approxima-\ntions wi. To perform passive statistical queries relative to Dt we use active SQs with a corresponding\nreal valued \ufb01lter. This algorithm is computationally ef\ufb01cient and uses only poly(d, log(1/\u0001)) active\nstatistical queries of tolerance inverse-polynomial in the dimension d and log(1/\u0001).\n\n1The sample complexity of the SQ analogues might increase sometimes though.\n\n2\n\n\f\u221a\n\nd) and \ufb01lter tolerance of \u2126(1/\u0001).\n\nFor the special case of the uniform distribution over the unit ball we give a new, simpler and substan-\ntially more ef\ufb01cient active statistical learning algorithm. Our algorithm is based on measuring the\nerror of a halfspace conditioned on being within some margin of that halfspace. We show that such\nmeasurements performed on the perturbations of the current hypothesis along the d basis vectors\ncan be combined to derive a better hypothesis. This approach differs substantially from the previous\nalgorithms for this problem [3, 22]. The algorithm is computationally ef\ufb01cient and uses d log(1/\u0001)\nactive SQs with tolerance of \u2126(1/\nThese results, combined with our generic simulation of active statistical algorithms in the presence\nof random classi\ufb01cation noise (RCN) lead to the \ufb01rst known computationally ef\ufb01cient algorithms\nfor actively learning halfspaces which are RCN tolerant and give provable label savings over the\npassive case. For the uniform distribution case this leads to an algorithm with sample complexity of\nO((1 \u2212 2\u03b7)\u22122 \u00b7 d2 log(1/\u0001) log(d log(1/\u0001))) and for the general isotropic log-concave case we get\nsample complexity of poly(d, log(1/\u0001), 1/(1 \u2212 2\u03b7)). This is worse than the sample complexity in\nthe noiseless case which is just O((d + log log(1/\u0001)) log(1/\u0001)) [8]. However, compared to passive\nlearning in the presence of RCN, our algorithms have exponentially better dependence on \u0001 and es-\nsentially the same dependence on d and 1/(1 \u2212 2\u03b7). One issue with the generic simulation is that\nit requires knowledge of \u03b7 (or an almost precise estimate). Standard approach to dealing with this\nissue does not always work in the active setting and for our log-concave and the uniform distribu-\ntion algorithms we give a specialized argument that preserves the exponential improvement in the\ndependence on \u0001.\n\nDifferentially-private active learning: In many application of machine learning such as medical\nand \ufb01nancial record analysis, data is both sensitive and expensive to label. However, to the best\nof our knowledge, there are no formal results addressing both of these constraints. We address\nthe problem by de\ufb01ning a natural model of differentially-private active learning. In our model we\nassume that a learner has full access to unlabeled portion of some database of n examples S \u2286\nX \u00d7 Y which correspond to records of individual participants in the database.\nIn addition, for\nevery element of the database S the learner can request the label of that element. As usual, the\ngoal is to minimize the number of label requests (such setup is referred to as pool-based active\nlearning [33]). In addition, we would like to preserve the differential privacy of the participants in\nthe database, a now-standard notion of privacy introduced in [25]. Informally speaking, an algorithm\nis differentially private if adding any record to S (or removing a record from S) does not affect the\nprobability that any speci\ufb01c hypothesis will be output by the algorithm signi\ufb01cantly.\nAs \ufb01rst shown by [11], SQ algorithms can be automatically translated into differentially-private\nalgorithms. Using a similar approach, we show that active SQ learning algorithms can be automat-\nically transformed into differentially-private active learning algorithms. Using our active statistical\nalgorithms for halfspaces we obtain the \ufb01rst algorithms that are both differentially-private and give\nexponential improvements in the dependence of label complexity on the accuracy parameter \u0001.\n\nAdditional related work: As we have mentioned, most prior theoretical work on active learning\nfocuses on either sample complexity bounds (without regard for ef\ufb01ciency) or the noiseless case.\nFor random classi\ufb01cation noise in particular, [6] provides a sample complexity analysis based on\nthe splitting index that is optimal up to polylog factors and works for general concept classes and\ndistributions, but it is not computationally ef\ufb01cient. In addition, several works give active learning\nalgorithms with empirical evidence of robustness to certain types of noise [9, 28];\nIn [16, 23] online learning algorithms in the selective sampling framework are presented, where\nlabels must be actively queried before they are revealed. Under the assumption that the label con-\nditional distribution is a linear function determined by a \ufb01xed target vector, they provide bounds\non the regret of the algorithm and on the number of labels it queries when faced with an adaptive\nadversarial strategy of generating the instances. As pointed out in [23], these results can also be\nconverted to a distributional PAC setting where instances xt are drawn i.i.d. In this setting they\nobtain exponential improvement in label complexity over passive learning. These interesting results\nand techniques are not directly comparable to ours. Our framework is not restricted to halfspaces.\nAnother important difference is that (as pointed out in [28]) the exponential improvement they give\nis not possible in the noiseless version of their setting. In other words, the addition of linear noise\nde\ufb01ned by the target makes the problem easier for active sampling. By contrast RCN can only make\nthe classi\ufb01cation task harder than in the realizable case.\n\n3\n\n\fDue to space constraint details of most proofs and further discussion appear in the full version of\nthis paper [5].\n\n2 Active Statistical Algorithms\n\nLet X be a domain and P be a distribution over labeled examples on X. We represent such a\ndistribution by a pair (D, \u03c8) where D is the marginal distribution of P on X and \u03c8 : X \u2192 [\u22121, 1]\nis a function de\ufb01ned as \u03c8(z) = E(x,(cid:96))\u223cP [(cid:96) | x = z]. We will be primarily considering learning in\nthe PAC model (realizable case) where \u03c8 is a boolean function, possibly corrupted by random noise.\nWhen learning with respect to a distribution P = (D, \u03c8), an active statistical learner has access to\nactive statistical queries. A query of this type is a pair of functions (\u03c7, \u03c6), where \u03c7 : X \u2192 [0, 1]\nis the \ufb01lter function which for a point x, speci\ufb01es the probability with which the label of x should\nbe queried. The function \u03c6 : X \u00d7 {\u22121, 1} \u2192 [\u22121, 1] is the query function and depends on both\npoint and the label. The \ufb01lter function \u03c7 de\ufb01nes the distribution D conditioned on \u03c7 as follows:\nfor each x the density function D|\u03c7(x) is de\ufb01ned as D|\u03c7(x) = D(x)\u03c7(x)/ED[\u03c7(x)]. Note that if\n\u03c7 is an indicator function of some set S then D|\u03c7 is exactly D conditioned on x being in S. Let\nP|\u03c7 denote the conditioned distribution (D|\u03c7, \u03c8). In addition, a query has two tolerance parameters:\n\ufb01lter tolerance \u03c40 and query tolerance \u03c4. In response to such a query the algorithm obtains a value\n\u00b5 such that if ED[\u03c7(x)] \u2265 \u03c40 then\n\n(cid:12)(cid:12)\u00b5 \u2212 EP|\u03c7[\u03c6(x, (cid:96))](cid:12)(cid:12) \u2264 \u03c4\n\n(and nothing is guaranteed when ED[\u03c7(x)] < \u03c40).\nAn active statistical learning algorithm can also ask target-independent queries with tolerance \u03c4\nwhich are just queries over unlabeled samples. That is for a query \u03d5 : X \u2192 [\u22121, 1] the algorithm\nobtains a value \u00b5, such that |\u00b5 \u2212 ED[\u03d5(x)]| \u2264 \u03c4. Such queries are not necessary when D is\nknown to the learner. Also for the purposes of obtaining noise tolerant algorithms one can relax the\nrequirements of model and give the learning algorithm access to unlabelled samples.\nOur de\ufb01nition generalizes the statistical query framework of Kearns [30] which does not include\n\ufb01ltering function, in other words a query is just a function \u03c6 : X \u00d7 {\u22121, 1} \u2192 [\u22121, 1] and it has a\nsingle tolerance parameter \u03c4. By de\ufb01nition, an active SQ (\u03c7, \u03c6) with tolerance \u03c4 relative to P is the\nsame as a passive statistical query \u03c6 with tolerance \u03c4 relative to the distribution P|\u03c7. In particular, a\n(passive) SQ is equivalent to an active SQ with \ufb01lter \u03c7 \u2261 1 and \ufb01lter tolerance 1.\nWe note that from the de\ufb01nition of active SQ we can see that\n\nEP|\u03c7[\u03c6(x, (cid:96))] = EP [\u03c6(x, (cid:96)) \u00b7 \u03c7(x)]/EP [\u03c7(x)].\n\nThis implies that an active statistical query can be estimated using two passive statistical queries.\nHowever to estimate EP|\u03c7[\u03c6(x, (cid:96))] with tolerance \u03c4 one needs to estimate EP [\u03c6(x, (cid:96)) \u00b7 \u03c7(x)] with\ntolerance \u03c4 \u00b7 EP [\u03c7(x)] which can be much lower than \u03c4. Tolerance of a SQ directly corresponds to\nthe number of examples needed to evaluate it and therefore simulating active SQs passively might\nrequire many more labeled examples.\n\n2.1 Simulating Active Statistical Queries\n\nWe \ufb01rst note that a valid response to a target-independent query with tolerance \u03c4 can be obtained,\nwith probability at least 1 \u2212 \u03b4, using O(\u03c4\u22122 log (1/\u03b4)) unlabeled samples.\nA natural way of simulating an active SQ is by \ufb01ltering points drawn randomly from D: draw a\nrandom point x, let B be drawn from Bernoulli distribution with probability of success \u03c7(x); ask\nfor the label of x when B = 1. The points for which we ask for a label are distributed according to\nD|\u03c7. This implies that the empirical average of \u03c6(x, (cid:96)) on O(\u03c4\u22122 log (1/\u03b4)) labeled examples will\nthen give \u00b5. Formally we get the following theorem.\nTheorem 2.1. Let P = (D, \u03c8) be a distribution over X \u00d7{\u22121, 1}. There exists an active sampling\nalgorithm that given functions \u03c7 : X \u2192 [0, 1], \u03c6 : X \u00d7 {\u22121, 1} \u2192 [\u22121, 1], values \u03c40 > 0,\n\u03c4 > 0, \u03b4 > 0, and access to samples from P , with probability at least 1 \u2212 \u03b4, outputs a valid\nresponse to active statistical query (\u03c7, \u03c6) with tolerance parameters (\u03c40, \u03c4 ). The algorithm uses\nO(\u03c4\u22122 log (1/\u03b4)) labeled examples from P and O(\u03c4\u22121\n0 \u03c4\u22122 log (1/\u03b4)) unlabeled samples from D.\n\n4\n\n\fA direct way to simulate all the queries of an active SQ algorithm is to estimate the response to each\nquery using fresh samples and use the union bound to ensure that, with probability at least 1 \u2212 \u03b4, all\nqueries are answered correctly. Such direct simulation of an algorithm that uses at most q queries can\nbe done using O(q\u03c4\u22122 log(q/\u03b4)) labeled examples and O(q\u03c4\u22121\n0 \u03c4\u22122 log (q/\u03b4)) unlabeled samples.\nHowever in many cases a more careful analysis can be used to reduce the sample complexity of\nsimulation.\nLabeled examples can be shared to simulate queries that use the same \ufb01lter \u03c7 and do not depend on\neach other. This implies that the sample size suf\ufb01cient for simulating q non-adaptive queries with the\nsame \ufb01lter scales logarithmically with q. More generally, given a set of q query functions (possibly\nchosen adaptively) which belong to some set Q of low complexity (such as VC dimension) one can\nreduce the sample complexity of estimating the answers to all q queries (with the same \ufb01lter) by\ninvoking the standard bounds based on uniform convergence (e.g. [14]).\n\n2.2 Noise tolerance\n\n0 \u03c4\u22122(1 \u2212 2\u03b7)\u22122 log (1/\u03b4)) unlabeled samples from D.\n\nAn important property of the simulation described in Theorem 2.1 is that it can be easily adapted\nto the case when the labels are corrupted by random classi\ufb01cation noise [1]. For a distribution\nP = (D, \u03c8) let P \u03b7 denote the distribution P with the label \ufb02ipped with probability \u03b7 randomly and\nindependently of an example. It is easy to see that P \u03b7 = (D, (1 \u2212 2\u03b7)\u03c8). We now show that, as in\nthe SQ model [30], active statistical queries can be simulated given examples from P \u03b7.\nTheorem 2.2. Let P = (D, \u03c8) be a distribution over examples and let \u03b7 \u2208 [0, 1/2) be a noise\nrate. There exists an active sampling algorithm that given functions \u03c7 : X \u2192 [0, 1], \u03c6 : X \u00d7\n{\u22121, 1} \u2192 [\u22121, 1], values \u03b7, \u03c40 > 0, \u03c4 > 0, \u03b4 > 0, and access to samples from P \u03b7, with\nprobability at least 1 \u2212 \u03b4, outputs a valid response to active statistical query (\u03c7, \u03c6) with tolerance\nparameters (\u03c40, \u03c4 ). The algorithm uses O(\u03c4\u22122(1\u2212 2\u03b7)\u22122 log (1/\u03b4)) labeled examples from P \u03b7 and\nO(\u03c4\u22121\nNote that the sample complexity of the resulting active sampling algorithm has information-\ntheoretically optimal quadratic dependence on 1/(1 \u2212 2\u03b7), where \u03b7 is the noise rate.\nRemark 2.3. This simulation assumes that \u03b7 is given to the algorithm exactly. It is easy to see from\n1\u22122\u03b7(cid:48) \u2208 [1 \u2212 \u03c4 /4, 1 + \u03c4 /4] can be used in place of \u03b7 (with\nthe proof, that any value \u03b7(cid:48) such that 1\u22122\u03b7\n2 (\u03c6(x, 1) \u2212 \u03c6(x,\u22121)) \u00b7 (cid:96)] set to (1 \u2212 2\u03b7)\u03c4 /4). In some learning\nthe tolerance of estimating EP \u03b7|\u03c7\n[ 1\nscenarios even an approximate value of \u03b7 is not known but it is known that \u03b7 \u2264 \u03b70 < 1/2. To address\nthis issue one can construct a sequence \u03b71, . . . , \u03b7k of guesses of \u03b7, run the learning algorithm with\neach of those guesses in place of the true \u03b7 and let h1, . . . , hk be the resulting hypotheses [30]. One\ncan then return the hypothesis hi among those that has the best agreement with a suitably large\nsample. It is not hard to see that k = O(\u03c4\u22121 \u00b7 log(1/(1\u2212 2\u03b70))) guesses will suf\ufb01ce for this strategy\nto work [2].\nPassive hypothesis testing requires \u2126(1/\u0001) labeled examples and might be too expensive to be used\nwith active learning algorithms. It is unclear if there exists a general approach for dealing with\nunknown \u03b7 in the active learning setting that does not increase substantially the labelled example\ncomplexity. However, as we will demonstrate, in the context of speci\ufb01c active learning algorithms\nvariants of this approach can be used to solve the problem.\n\nWe now show that more general types of noise can be tolerated as long as they are \u201cuncorrelated\u201d\nwith the queries and the target function. Namely, we represent label noise using a function \u039b : X \u2192\n[0, 1], where \u039b(x) gives the probability that the label of x is \ufb02ipped. The rate of \u039b when learning\nwith respect to marginal distribution D over X is ED[\u039b(x)]. For a distribution P = (D, \u03c8) over\nexamples, we denote by P \u039b the distribution P corrupted by label noise \u039b. It is easy to see that\nP \u039b = (D, \u03c8 \u00b7 (1 \u2212 2\u039b)). Intuitively, \u039b is \u201cuncorrelated\u201d with a query if the way that \u039b deviates\nfrom its rate is almost orthogonal to the query on the target distribution.\nDe\ufb01nition 2.4. Let P = (D, \u03c8) be a distribution over examples and \u03c4(cid:48) > 0. For functions \u03c7 :\nX \u2192 [0, 1], \u03c6 : X \u00d7 {\u22121, 1} \u2192 [\u22121, 1], we say that a noise function \u039b : X \u2192 [0, 1] is (\u03b7, \u03c4(cid:48))-\nuncorrelated with \u03c6 and \u03c7 over P if,\n\n(cid:12)(cid:12)(cid:12)(cid:12)ED|\u03c7\n\n(cid:20) \u03c6(x, 1) \u2212 \u03c6(x,\u22121)\n\n2\n\n(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03c4(cid:48) .\n\n\u03c8(x) \u00b7 (1 \u2212 2(\u039b(x) \u2212 \u03b7))\n\n5\n\n\fIn this de\ufb01nition (1\u2212 2(\u039b(x)\u2212 \u03b7)) is the expectation of {\u22121, 1} coin that is \ufb02ipped with probability\n\u039b(x)\u2212 \u03b7, whereas (\u03c6(x, 1)\u2212 \u03c6(x,\u22121))\u03c8(x) is the part of the query which measures the correlation\nwith the label. We now give an analogue of Theorem 2.2 for this more general setting.\nTheorem 2.5. Let P = (D, \u03c8) be a distribution over examples, \u03c7 : X \u2192 [0, 1], \u03c6 : X \u00d7{\u22121, 1} \u2192\n[\u22121, 1] be a query and a \ufb01lter functions, \u03b7 \u2208 [0, 1/2), \u03c4 > 0 and \u039b be a noise function that is\n(\u03b7, (1 \u2212 2\u03b7)\u03c4 /4)-uncorrelated with \u03c6 and \u03c7 over P . There exists an active sampling algorithm that\ngiven functions \u03c7 and \u03c6, values \u03b7, \u03c40 > 0, \u03c4 > 0, \u03b4 > 0, and access to samples from P \u039b, with\nprobability at least 1 \u2212 \u03b4, outputs a valid response to active statistical query (\u03c7, \u03c6) with tolerance\nparameters (\u03c40, \u03c4 ). The algorithm uses O(\u03c4\u22122(1\u2212 2\u03b7)\u22122 log (1/\u03b4)) labeled examples from P \u039b and\nO(\u03c4\u22121\n\n0 \u03c4\u22122(1 \u2212 2\u03b7)\u22122 log (1/\u03b4)) unlabeled samples from D.\n\nAn immediate implication of Theorem 2.5 is that one can simulate an active SQ algorithm A using\nexamples corrupted by noise \u039b as long as \u039b is (\u03b7, (1 \u2212 2\u03b7)\u03c4 /4)-uncorrelated with all A\u2019s queries of\ntolerance \u03c4 for some \ufb01xed \u03b7.\n\n2.3 Simple examples\n\nThresholds: We show that a classic example of active learning a threshold function on an interval\ncan be easily expressed using active SQs. For simplicity and without loss of generality we can\nassume that the interval is [0, 1] and the distribution is uniform over it (as usual, we can bring the\ndistribution to be close enough to this form using unlabeled samples or target-independent queries).\nAssume that we know that the threshold \u03b8 belongs to the interval [a, b] \u2286 [0, 1]. We ask a query\n\u03c6(x, (cid:96)) = ((cid:96)+1)/2 with \ufb01lter \u03c7(x) which is the indicator function of the interval [a, b] with tolerance\n1/4 and \ufb01lter tolerance b \u2212 a. Let v be the response to the query. By de\ufb01nition, E[\u03c7(x)] = b \u2212 a\nand therefore we have that |v \u2212 E[\u03c6(x, (cid:96)) | x \u2208 [a, b]]| \u2264 1/4. Note that,\nE[\u03c6(x, (cid:96)) | x \u2208 [a, b]] = (b \u2212 \u03b8)/(b \u2212 a) .\n\nWe can therefore conclude that (b \u2212 \u03b8)/(b \u2212 a) \u2208 [v \u2212 1/4, v + 1/4] which means that \u03b8 \u2208\n[b \u2212 (v + 1/4)(b \u2212 a), b \u2212 (v \u2212 1/4)(b \u2212 a)] \u2229 [a, b]. Note that the length of this interval is at most\n(b \u2212 a)/2. This means that after at most log2(1/\u0001) + 1 iterations we will reach an interval [a, b]\nof length at most \u0001. In each iteration only constant 1/4 tolerance is necessary and \ufb01lter tolerance is\nnever below \u0001. A direct simulation of this algorithm can be done using log(1/\u0001) \u00b7 log(log(1/\u0001)/\u03b4)\nlabeled examples and \u02dcO(1/\u0001) \u00b7 log(1/\u03b4) unlabeled samples.\nLearning of thresholds can also be easily used to obtain a simple algorithm for learning axis-aligned\nrectangles whose weight under the target distribution is not too small.\n\nA2 : We now note that the general and well-studied A2 algorithm of [4] falls naturally into our\nframework. At a high level, the A2 algorithm is an iterative, disagreement-based active learning\nalgorithm. It maintains a set of surviving classi\ufb01ers Ci \u2286 C, and in each round the algorithm asks\nfor the labels of a few random points that fall in the current region of disagreement of the surviving\nclassi\ufb01ers. Formally, the region of disagreement DIS(Ci) of a set of classi\ufb01ers Ci is the of set of\ninstances x such that for each x \u2208 DIS(Ci) there exist two classi\ufb01ers f, g \u2208 Ci that disagree about\nthe label of x. Based on the queried labels, the algorithm then eliminates hypotheses that were\nstill under consideration, but only if it is statistically con\ufb01dent (given the labels queried in the last\nround) that they are suboptimal. In essence, in each round A2 only needs to estimate the error rates\n(of hypotheses still under consideration) under the conditional distribution of being in the region of\ndisagreement. This can be easily done via active statistical queries. Note that while the number of\nactive statistical queries needed to do this could be large, the number of labeled examples needed\nto simulate these queries is essentially the same as the number of labeled examples needed by the\nknown A2 analyses [29]. While in general the required computation of the disagreement region\nand manipulations of the hypothesis space cannot be done ef\ufb01ciently, ef\ufb01cient implementation is\npossible in a number of simple cases such as when the VC dimension of the concept class is a\nconstant.\nIt is not hard to see that in these cases the implementation can also be done using a\nstatistical algorithm.\n\n6\n\n\f3 Learning of halfspaces\n\nIn this section we outline our reduction from active learning to passive learning of homogeneous\nlinear separators based on the analysis of Balcan and Long [8]. Combining it with the SQ learning\nalgorithm for halfspaces by Dunagan and Vempala [24], we obtain the \ufb01rst ef\ufb01cient noise-tolerant\nactive learning of homogeneous halfspaces for any isotropic log-concave distribution. One of the\nkey point of this result is that it is relatively easy to harness the involved results developed for SQ\nframework to obtain new active statistical algorithms.\nLet Hd denote the concept class of all homogeneous halfspaces. Recall that a distribution over Rd\nis log-concave if log f (\u00b7) is concave, where f is its associated density function. It is isotropic if its\nmean is the origin and its covariance matrix is the identity. Log-concave distributions form a broad\nclass of distributions: for example, the Gaussian, Logistic, Exponential, and uniform distribution\nover any convex set are log-concave distributions. Using results in [24] and properties of log-concave\ndistributions, we can show:\nTheorem 3.1. There exists a SQ algorithm LearnHS that learns Hd to accuracy 1 \u2212 \u0001 over any\ndistribution D|\u03c7, where D is an isotropic log-concave distribution and \u03c7 : Rd \u2192 [0, 1] is a \ufb01lter\nfunction. Further LearnHS outputs a homogeneous halfspace, runs in time polynomial in d,1/\u0001\nand log(1/\u03bb) and uses SQs of tolerance \u2265 1/poly(d, 1/\u0001, log(1/\u03bb)), where \u03bb = ED[\u03c7(x)].\nWe now state the properties of our new algorithm formally.\nTheorem 3.2. There exists an active SQ algorithm ActiveLearnHS-LogC (Algorithm 1) that\nfor any isotropic log-concave distribution D on Rd, learns Hd over D to accuracy 1 \u2212 \u0001 in time\npoly(d, log(1/\u0001)) and using active SQs of tolerance \u2265 1/poly(d, log(1/\u0001)) and \ufb01lter tolerance \u2126(\u0001).\n\nAlgorithm 1 ActiveLearnHS-LogC: Active SQ learning of homogeneous halfspaces over\nisotropic log-concave densities\n1: %% Constants c, C1, C2 and C3 are determined by the analysis.\n2: Run LearnHS with error C2 to obtain w0.\n3: for k = 1 to s = (cid:100)log2(1/(c\u0001))(cid:101) do\n4:\n5:\n6:\n7:\n\nLet bk\u22121 = C1/2k\u22121\nLet \u00b5k equal the indicator function of being within margin bk\u22121 of wk\u22121\n\nRun LearnHS over Dk = D|\u03c7k with error C2/k by using active queries with \ufb01lter \u03c7k and\n\ufb01lter tolerance C3\u0001 to obtain wk\n\nLet \u03c7k = ((cid:80)\n\ni\u2264k \u00b5i)/k\n\n8: end for\n9: return ws\n\nWe remark that, as usual, we can \ufb01rst bring the distribution to an isotropic position by using target\nindependent queries to estimate the mean and the covariance matrix of the distribution. Therefore\nour algorithm can be used to learn halfspaces over general log-concave densities as long as the target\nhalfspace passes through the mean of the density.\nWe can now apply Theorem 2.2 (or more generally Theorem 2.5) to obtain an ef\ufb01cient active learn-\ning algorithm for homogeneous halfspaces over log-concave densities in the presence of random\nclassi\ufb01cation noise of known rate. Further since our algorithm relies on LearnHS which can also\nbe simulated when the noise rate is unknown (see Remark 2.3) we obtain an active algorithm which\ndoes not require the knowledge of the noise rate.\nCorollary 3.3. There exists a polynomial-time active learning algorithm that for any \u03b7 \u2208 [0, 1/2),\nlearns Hd over any log-concave distributions with random classi\ufb01cation noise of rate \u03b7 to error \u0001\nusing poly(d, log(1/\u0001), 1/(1 \u2212 2\u03b7)) labeled examples and a polynomial number of unlabeled sam-\nples.\n\nFor the special case of the uniform distribution on the unit sphere (or, equivalently for our purposes,\nunit ball) we give a substantially simpler and more ef\ufb01cient algorithm in terms of both sample and\ncomputational complexity. This setting was previously studied in [3, 22]. The detailed presentation\nof the technical ideas appears in the full version of the paper [5].\n\n7\n\n\fTheorem 3.4. There exists an active SQ algorithm ActiveLearnHS-U that learns Hd over\nthe uniform distribution on the (d \u2212 1)-dimensional unit sphere to accuracy 1 \u2212 \u0001, uses (d +\n\u221a\n1) log(1/\u0001) active SQs with tolerance of \u2126(1/\nd) and \ufb01lter tolerance of \u2126(1/\u0001) and runs in time\nd \u00b7 poly(log (d/\u0001)).\n\n4 Differentially-private active learning\n\nIn this section we show that active SQ learning algorithms can also be used to obtain differentially\nprivate active learning algorithms. Formally, for some domain X \u00d7 Y , we will call S \u2286 X \u00d7 Y a\ndatabase. Databases S, S(cid:48) \u2282 X\u00d7Y are adjacent if one can be obtained from the other by modifying\na single element. Here we will always have Y = {\u22121, 1}. In the following, A is an algorithm that\ntakes as input a database D and outputs an element of some \ufb01nite set R.\nDe\ufb01nition 4.1 (Differential privacy [25]). A (randomized) algorithm A : 2X\u00d7Y \u2192 R is \u03b1-\ndifferentially private if for all r \u2208 R and every pair of adjacent databases S, S(cid:48), we have\nPr[A(S) = r] \u2264 e\u0001 Pr[A(S(cid:48)) = r].\n\nHere we consider algorithms that operate on S in an active way. That is the learning algorithm\nreceives the unlabeled part of each point in S as an input and can only obtain the label of a point\nupon request. The total number of requests is the label complexity of the algorithm.\nTheorem 4.2. Let A be an algorithm that learns a class of functions H to accuracy 1 \u2212 \u0001 over dis-\ntribution D using M1 active SQs of tolerance \u03c4 and \ufb01lter tolerance \u03c40 and M2 target-independent\nqueries of tolerance \u03c4u. There exists a learning algorithm A(cid:48) that given \u03b1 > 0, \u03b4 > 0 and ac-\ntive access to database S \u2286 X \u00d7 {\u22121, 1} is \u03b1-differentially private and uses at most O([ M1\n\u03b1\u03c4 +\n\u03c4 2 ] log(M1/\u03b4)) labels. Further, for some n = O([ M1\n] log((M1 + M2)/\u03b4)),\nM1\nif S consists of at least n examples drawn randomly from D then with, probability at least 1 \u2212 \u03b4, A(cid:48)\noutputs a hypothesis with accuracy \u2265 1 \u2212 \u0001 (relative to distribution D). The running time of A(cid:48) is\nthe same as the running time of A plus O(n).\n\n\u03b1\u03c40\u03c4 + M1\n\n\u03c40\u03c4 2 + M2\n\u03b1\u03c4u\n\n+ M2\n\u03c4 2\nu\n\nAn immediate consequence of Theorem 4.2 is that for learning of homogeneous halfspaces over\nuniform or log-concave distributions we can obtain differential privacy while essentially preserving\nthe label complexity. For example, by combining Theorems 4.2 and 3.4, we can ef\ufb01ciently and\ndifferentially-privately learn homogeneous halfspaces under the uniform distribution with privacy\nparameter \u03b1 and error parameter \u0001 by using only O(d\nd log(1/\u0001))/\u03b1 + d2 log(1/\u0001)) labels. How-\never, it is known that any passive learning algorithm, even ignoring privacy considerations and noise\nd and small enough \u0001 we get better\n\n(cid:1)(cid:1) labeled examples. So for \u03b1 \u2265 1/\n\nrequires \u2126(cid:0) d\n\n\u0001 log(cid:0) 1\n\n\u221a\n\n\u0001 + 1\nlabel complexity.\n\n\u03b4\n\n\u221a\n\n5 Discussion\n\nOur work suggests that, as in passive learning, active statistical algorithms might be essentially\nas powerful as example-based ef\ufb01cient active learning algorithms. It would be interesting to \ufb01nd\nmore general evidence supporting this claim or, alternatively, a counterexample. A nice aspect of\n(passive) statistical learning algorithms is that it is possible to prove unconditional lower bounds on\nsuch algorithms using SQ dimension [13] and its extensions. It would be interesting to develop an\nactive analogue of these techniques and give meaningful lower bounds based on them.\n\nAcknowledgments We thank Avrim Blum and Santosh Vempala for useful discussions. This work\nwas supported in part by NSF grants CCF-0953192 and CCF-1101215, AFOSR grant FA9550-09-\n1-0538, ONR grant N00014-09-1-0751, and a Microsoft Research Faculty Fellowship.\n\nReferences\n[1] D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2:343\u2013370, 1988.\n[2] J. Aslam and S. Decatur. Speci\ufb01cation and simulation of statistical query algorithms for ef\ufb01ciency and\n\nnoise tolerance. JCSS, 56:191\u2013208, 1998.\n\n[3] M. Balcan, A. Broder, and T. Zhang. Margin based active learning. In COLT, pages 35\u201350, 2007.\n\n8\n\n\f[4] M. F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, 2006.\n[5] M. F. Balcan and V. Feldman. Statistical active learning algorithms, 2013. ArXiv:1307.3102.\n[6] M.-F. Balcan and S. Hanneke. Robust interactive learning. In COLT, 2012.\n[7] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In COLT,\n\n2008.\n\n[8] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distri-\n\nbutions. JMLR - COLT proceedings (to appear), 2013.\n\n[9] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, pages\n\n49\u201356, 2009.\n\n[10] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In\n\nNIPS, 2010.\n\n[11] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In Proceedings\n\nof PODS, pages 128\u2013138, 2005.\n\n[12] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for learning noisy linear\n\nthreshold functions. Algorithmica, 22(1/2):35\u201352, 1997.\n\n[13] A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, and S. Rudich. Weakly learning DNF and charac-\n\nterizing statistical query learning using Fourier analysis. In STOC, pages 253\u2013262, 1994.\n\n[14] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis\n\ndimension. Journal of the ACM, 36(4):929\u2013965, 1989.\n\n[15] R. Castro and R. Nowak. Minimax bounds for active learning. In COLT, 2007.\n[16] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Learning noisy linear classi\ufb01ers via adaptive and selective\n\nsampling. Machine Learning, 2010.\n\n[17] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for machine learning on\n\nmulticore. In Proceedings of NIPS, pages 281\u2013288, 2006.\n\n[18] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, volume 18, 2005.\n[19] S. Dasgupta. Active learning. Encyclopedia of Machine Learning, 2011.\n[20] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In ICML, pages 208\u2013215, 2008.\n[21] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. NIPS, 20, 2007.\n[22] S. Dasgupta, A. Tauman Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Journal\n\nof Machine Learning Research, 10:281\u2013299, 2009.\n\n[23] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multiple\n\nteachers. JMLR, 2012.\n\n[24] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs.\n\nIn STOC, pages 315\u2013320, 2004.\n\n[25] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis.\n\nIn TCC, pages 265\u2013284, 2006.\n\n[26] V. Feldman. A complete characterization of statistical query learning with applications to evolvability.\n\nJournal of Computer System Sciences, 78(5):1444\u20131459, 2012.\n\n[27] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee\n\nalgorithm. Machine Learning, 28(2-3):133\u2013168, 1997.\n\n[28] A. Gonen, S. Sabato, and S. Shalev-Shwartz. Ef\ufb01cient pool-based active learning of halfspaces. In ICML,\n\n2013.\n\n[29] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.\n[30] M. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. JACM, 45(6):983\u20131006, 1998.\n[31] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning.\n\nJMLR,\n\n11:2457\u20132485, 2010.\n\n[32] L. Lov\u00b4asz and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random\n\nStruct. Algorithms, 30(3):307\u2013358, 2007.\n\n[33] A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classi\ufb01cation.\n\nICML, pages 350\u2013358, 1998.\n\nIn\n\n[34] M. Raginsky and A. Rakhlin. Lower bounds for passive and active learning. In NIPS, pages 1026\u20131034,\n\n2011.\n\n[35] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142, 1984.\n\n9\n\n\f", "award": [], "sourceid": 665, "authors": [{"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Georgia Tech"}, {"given_name": "Vitaly", "family_name": "Feldman", "institution": "IBM Research"}]}