{"title": "Sample and Computationally Efficient Learning Algorithms under S-Concave Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 4796, "page_last": 4805, "abstract": "We provide new results for noise-tolerant and sample-efficient learning algorithms under $s$-concave distributions. The new class of $s$-concave distributions is a broad and natural generalization of log-concavity, and includes many important additional distributions, e.g., the Pareto distribution and $t$ distribution. This class has been studied in the context of efficient sampling, integration, and optimization, but much remains unknown about the geometry of this class of distributions and their applications in the context of learning. The challenge is that unlike the commonly used distributions in learning (uniform or more generally log-concave distributions), this broader class is not closed under the marginalization operator and many such distributions are fat-tailed. In this work, we introduce new convex geometry tools to study the properties of $s$-concave distributions and use these properties to provide bounds on quantities of interest to learning including the probability of disagreement between two halfspaces, disagreement outside a band, and the disagreement coefficient. We use these results to significantly generalize prior results for margin-based active learning, disagreement-based active learning, and passive learning of intersections of halfspaces. Our analysis of geometric properties of $s$-concave distributions might be of independent interest to optimization more broadly.", "full_text": "Sample and Computationally Ef\ufb01cient Learning\n\nAlgorithms under S-Concave Distributions\n\nMaria-Florina Balcan\n\nMachine Learning Department\n\nCarnegie Mellon University, USA\n\nninamf@cs.cmu.edu\n\nHongyang Zhang\u2217\n\nMachine Learning Department\n\nCarnegie Mellon University, USA\n\nhongyanz@cs.cmu.edu\n\nAbstract\n\nWe provide new results for noise-tolerant and sample-ef\ufb01cient learning algorithms\nunder s-concave distributions. The new class of s-concave distributions is a broad\nand natural generalization of log-concavity, and includes many important additional\ndistributions, e.g., the Pareto distribution and t-distribution. This class has been\nstudied in the context of ef\ufb01cient sampling, integration, and optimization, but\nmuch remains unknown about the geometry of this class of distributions and their\napplications in the context of learning. The challenge is that unlike the commonly\nused distributions in learning (uniform or more generally log-concave distributions),\nthis broader class is not closed under the marginalization operator and many such\ndistributions are fat-tailed.\nIn this work, we introduce new convex geometry\ntools to study the properties of s-concave distributions and use these properties\nto provide bounds on quantities of interest to learning including the probability\nof disagreement between two halfspaces, disagreement outside a band, and the\ndisagreement coef\ufb01cient. We use these results to signi\ufb01cantly generalize prior\nresults for margin-based active learning, disagreement-based active learning, and\npassive learning of intersections of halfspaces. Our analysis of geometric properties\nof s-concave distributions might be of independent interest to optimization more\nbroadly.\n\n1\n\nIntroduction\n\nDeveloping provable learning algorithms is one of the central challenges in learning theory. The study\nof such algorithms has led to signi\ufb01cant advances in both the theory and practice of passive and active\nlearning. In the passive learning model, the learning algorithm has access to a set of labeled examples\nsampled i.i.d. from some unknown distribution over the instance space and labeled according to\nsome underlying target function. In the active learning model, however, the algorithm can access\nunlabeled examples and request labels of its own choice, and the goal is to learn the target function\nwith signi\ufb01cantly fewer labels. In this work, we study both learning models in the case where the\nunderlying distribution belongs to the class of s-concave distributions.\nPrior work on noise-tolerant and sample-ef\ufb01cient algorithms mostly relies on the assumption that\nthe distribution over the instance space is log-concave [1, 12, 7, 30]. A distribution is log-concave\nif the logarithm of its density is a concave function. The assumption of log-concavity has been\nmade for a few purposes: for computational ef\ufb01ciency reasons and for sample ef\ufb01ciency reasons.\nFor computational ef\ufb01ciency reasons, it was made to obtain a noise-tolerant algorithm even for\nseemingly simple decision surfaces like linear separators. These simple algorithms exist for noiseless\nscenarios, e.g., via linear programming [28], but they are notoriously hard once we have noise [15,\n25, 19]; This is why progress on noise-tolerant algorithms has focused on uniform [22, 26] and\n\n\u2217Corresponding author.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\flog-concave distributions [4]. Other concept spaces, like intersections of halfspaces, even have no\ncomputationally ef\ufb01cient algorithm in the noise-free settings that works under general distributions,\nbut there has been nice progress under uniform and log-concave distributions [27]. For sample\nef\ufb01ciency reasons, in the context of active learning, we need distributional assumptions in order to\nobtain label complexity improvements [16]. The most concrete and general class for which prior\nwork obtains such improvements is when the marginal distribution over instance space satis\ufb01es\nlog-concavity [32, 7]. In this work, we provide a broad generalization of all above results, showing\nhow they extend to s-concave distributions (s < 0). A distribution with density f (x) is s-concave\nif f (x)s is a concave function. We identify key properties of these distributions that allow us to\nsimultaneously extend all above results.\n\nHow general and important is the class of s-concave distributions? The class of s-concave\ndistributions is very broad and contains many well-known (classes of) distributions as special cases.\nFor example, when s \u2192 0, s-concave distributions reduce to log-concave distributions. Furthermore,\nthe s-concave class contains in\ufb01nitely many fat-tailed distributions that do not belong to the class of\nlog-concave distributions, e.g., Cauchy, Pareto, and t distributions, which have been widely applied in\nthe context of theoretical physics and economics, but much remains unknown about how the provable\nlearning algorithms, such as active learning of halfspaces, perform under these realistic distributions.\nWe also compare s-concave distributions with nearly-log-concave distributions, a slightly broader\nclass of distributions than log-concavity. A distribution with density f (x) is nearly-log-concave if\nfor any \u03bb \u2208 [0, 1], x1, x2 \u2208 Rn, we have f (\u03bbx1 + (1 \u2212 \u03bb)x2) \u2265 e\u22120.0154f (x1)\u03bbf (x2)1\u2212\u03bb [7]. The\nclass of s-concave distributions includes many important extra distributions which do not belong to\nthe nearly-log-concave distributions: a nearly-log-concave distribution must have sub-exponential\ntails (see Theorem 11, [7]), while the tail probability of an s-concave distribution might decay much\nslower (see Theorem 1 (6)). We also note that ef\ufb01cient sampling, integration and optimization\nalgorithms for s-concave distributions have been well understood [13, 23]. Our analysis of s-concave\ndistributions bridges these algorithms to the strong guarantees of noise-tolerant and sample-ef\ufb01cient\nlearning algorithms.\n\n1.1 Our Contributions\n\nStructural Results. We study various geometric properties of s-concave distributions. These proper-\nties serve as the structural results for many provable learning algorithms, e.g., margin-based active\nlearning [7], disagreement-based active learning [29, 21], learning intersections of halfspaces [27],\netc. When s \u2192 0, our results exactly reduce to those for log-concave distributions [7, 2, 4]. Below,\nwe state our structural results informally:\nTheorem 1 (Informal). Let D be an isotropic s-concave distribution in Rn. Then there exist closed-\nform functions \u03b3(s, m), f1(s, n), f2(s, n), f3(s, n), f4(s, n), and f5(s, n) such that\n1. (Weakly Closed under Marginal) The marginal of D over m arguments (or cumulative distribution\n\nfunction, CDF) is isotropic \u03b3(s, m)-concave. (Theorems 3, 4)\n2. (Lower Bound on Hyperplane Disagreement) For any two unit vectors u and v in Rn,\nf1(s, n)\u03b8(u, v) \u2264 Prx\u223cD[sign(u \u00b7 x) (cid:54)= sign(v \u00b7 x)], where \u03b8(u, v) is the angle between u\nand v. (Theorem 12)\n3. (Probability of Band) There is a function d(s, n) such that for any unit vector w \u2208 Rn and any\n0 < t \u2264 d(s, n), we have f2(s, n)t < Prx\u223cD[|w \u00b7 x| \u2264 t] \u2264 f3(s, n)t. (Theorem 11)\n4. (Disagreement outside Margin) For any absolute constant c1 > 0 and any function f (s, n),\nthere exists a function f4(s, n) > 0 such that Prx\u223cD[sign(u \u00b7 x) (cid:54)= sign(v \u00b7 x) and |v \u00b7 x| \u2265\nf4(s, n)\u03b8(u, v)] \u2264 c1f (s, n)\u03b8(u, v). (Theorem 13)\n5. (Variance in 1-D Direction) There is a function d(s, n) such that for any unit vectors u and a in Rn\nsuch that (cid:107)u\u2212a(cid:107) \u2264 r and for any 0 < t \u2264 d(s, n), we have Ex\u223cDu,t [(a\u00b7x)2] \u2264 f5(s, n)(r2 +t2),\nwhere Du,t is the conditional distribution of D over the set {x : |u \u00b7 x| \u2264 t}. (Theorem 14)\n\n6. (Tail Probability) We have Pr[(cid:107)x(cid:107) >\nIf s \u2192 0 (i.e., the distribution is log-concave), then \u03b3(s, m) \u2192 0 and the functions f (s, n), f1(s, n),\nf2(s, n), f3(s, n), f4(s, n), f5(s, n), and d(s, n) are all absolute constants.\n\n. (Theorem 5)\n\n1 \u2212 cst\n\n1+ns\n\nnt] \u2264(cid:104)\n\n\u221a\n\n(cid:105)(1+ns)/s\n\n2\n\n\fTable 1: Comparisons with prior distributions for margin-based active learning, disagreement-based\nactive learning, and Baum\u2019s algorithm.\n\nMargin (Ef\ufb01cient, Noise)\n\nDisagreement\n\nBaum\u2019s\n\nuniform [3]\nuniform [20]\nsymmetric [9]\n\nPrior Work\n\nlog-concave [4]\n\nnearly-log-concave [7]\n\nlog-concave [27]\n\nOurs\n\ns-concave\ns-concave\ns-concave\n\nTo prove Theorem 1, we introduce multiple new techniques, e.g., extension of Prekopa-Leindler\ntheorem and reduction to baseline function (see the supplementary material for our techniques),\nwhich might be of independent interest to optimization more broadly.\nMargin Based Active Learning: We apply our structural results to margin-based active learning of\na halfspace w\u2217 under any isotropic s-concave distribution for both realizable and adversarial noise\nmodels. In the realizable case, the instance X is drawn from an isotropic s-concave distribution and\nthe label Y = sign(w\u2217 \u00b7 X). In the adversarial noise model, an adversary can corrupt any \u03b7 (\u2264 O(\u0001))\nfraction of labels. For both cases, we show that there exists a computationally ef\ufb01cient algorithm that\noutputs a linear separator wT such that Prx\u223cD[sign(wT \u00b7 x) (cid:54)= sign(w\u2217 \u00b7 x)] \u2264 \u0001 (see Theorems 15\nand 16). The label complexity w.r.t. 1/\u0001 improves exponentially over the passive learning scenario\nunder s-concave distributions, though the underlying distribution might be fat-tailed. To the best\nof our knowledge, this is the \ufb01rst result concerning the computationally-ef\ufb01cient, noise-tolerant\nmargin-based active learning under the broader class of s-concave distributions. Our work solves\nan open problem proposed by Awasthi et al. [4] about exploring wider classes of distributions for\nprovable active learning algorithms.\nDisagreement Based Active Learning: We apply our results to agnostic disagreement-based active\nlearning under s-concave distributions. The key to the analysis is estimating the disagreement\ncoef\ufb01cient, a distribution-dependent measure of complexity that is used to analyze certain types of\nactive learning algorithms, e.g., the CAL [14] and A2 algorithm [5]. We work out the disagreement\ncoef\ufb01cient under isotropic s-concave distribution (see Theorem 17). By composing it with the\nexisting work on active learning [17], we obtain a bound on label complexity under the class of\ns-concave distributions. As far as we are aware, this is the \ufb01rst result concerning disagreement-\nbased active learning that goes beyond log-concave distributions. Our bounds on the disagreement\ncoef\ufb01cient match the best known results for the much less general case of log-concave distributions [7];\nFurthermore, they apply to the s-concave case where we allow an arbitrary number of discontinuities,\na case not captured by [18].\nLearning Intersections of Halfspaces: Baum\u2019s algorithm is one of the most famous algorithms\nfor learning the intersections of halfspaces. The algorithm was \ufb01rst proposed by Baum [9] under\nsymmetric distribution, and later extended to log-concave distribution by Klivans et al. [27] as these\ndistributions are almost symmetric. In this paper, we show that approximate symmetry also holds for\nthe case of s-concave distributions. With this, we work out the label complexity of Baum\u2019s algorithm\nunder the broader class of s-concave distributions (see Theorem 18), and advance the state-of-the-art\nresults (see Table 1).\nWe provide lower bounds to partially show the tightness of our analysis. Our results can be potentially\napplied to other provable learning algorithms as well [24, 31, 10, 30, 8], which might be of inde-\npendent interest. We discuss our techniques and other related papers in the supplementary material.\n\ndenote by B(\u03b1, \u03b2) =(cid:82) 1\n\n2 Preliminary\nBefore proceeding, we de\ufb01ne some notations and clarify our problem setup in this section.\nNotations: We will use capital or lower-case letters to represent random variables, D to represent\nan s-concave distribution, and Du,t to represent the conditional distribution of D over the set\n|u \u00b7 x| \u2264 t}. We de\ufb01ne the sign function as sign(x) = +1 if x \u2265 0 and \u22121 otherwise. We\n{x :\nt\u03b1\u22121e\u2212tdt the gamma\nfunction. We will consider a single norm for the vectors in Rn, namely, the 2-norm denoted by\n(cid:107)x(cid:107). We will frequently use \u00b5 (or \u00b5f , \u00b5D) to represent the measure of the probability distribution\nD with density function f. The notation ball(w\u2217, t) represents the set {w \u2208 Rn : (cid:107)w \u2212 w\u2217(cid:107) \u2264 t}.\nFor convenience, the symbol \u2295 slightly differs from the ordinary addition +: For f = 0 or g = 0,\n{f s \u2295 gs}1/s = 0; Otherwise, \u2295 and + are the same. For u, v \u2208 Rn, we de\ufb01ne the angle between\nthem as \u03b8(u, v).\n\n0 t\u03b1\u22121(1\u2212 t)\u03b2\u22121dt the beta function, and \u0393(\u03b1) =(cid:82) \u221e\n\n0\n\n3\n\n\f2.1 From Log-Concavity to S-Concavity\nWe begin with the de\ufb01nition of s-concavity. There are slight differences among the de\ufb01nitions of\ns-concave density, s-concave distribution, and s-concave measure.\nDe\ufb01nition 1 (S-Concave (Density) Function, Distribution, Measure). A function f: Rn \u2192 R+ is\ns-concave, for \u2212\u221e \u2264 s \u2264 1, if f (\u03bbx + (1 \u2212 \u03bb)y) \u2265 (\u03bbf (x)s + (1 \u2212 \u03bb)f (y)s)1/s for all \u03bb \u2208 [0, 1],\n\u2200x, y \u2208 Rn.2 A probability distribution D is s-concave, if its density function is s-concave. A\nprobability measure \u00b5 is s-concave if \u00b5(\u03bbA + (1 \u2212 \u03bb)B) \u2265 [\u03bb\u00b5(A)s + (1 \u2212 \u03bb)\u00b5(B)s]1/s for any\nsets A, B \u2286 Rn and \u03bb \u2208 [0, 1].\nSpecial classes of s-concave functions include concavity (s = 1), harmonic-concavity (s = \u22121),\nquasi-concavity (s = \u2212\u221e), etc. The conditions in De\ufb01nition 1 are progressively weaker as s becomes\nsmaller: s1-concave densities (distributions, measures) are s2-concave if s1 \u2265 s2. Thus one can\nverify [13]: concave (s = 1) (cid:40) log-concave (s = 0) (cid:40) s-concave (s < 0) (cid:40) quasi-concave (s =\n\u2212\u221e).\n\n3 Structural Results of S-Concave Distributions: A Toolkit\n\nIn this section, we develop geometric properties of s-concave distribution. The challenge is that unlike\nthe commonly used distributions in learning (uniform or more generally log-concave distributions),\nthis broader class is not closed under the marginalization operator and many such distributions are fat-\ntailed. To address this issue, we introduce several new techniques. We \ufb01rst introduce the extension of\nthe Prekopa-Leindler inequality so as to reduce the high-dimensional problem to the one-dimensional\ncase. We then reduce the resulting one-dimensional s-concave function to a well-de\ufb01ned baseline\nfunction, and explore the geometric properties of that baseline function. We summarize our high-level\nproof ideas brie\ufb02y by the following \ufb01gure.\n\n3.1 Marginal Distribution and Cumulative Distribution Function\nWe begin with the analysis of the marginal distribution, which forms the basis of other geometric\nproperties of s-concave distributions (s \u2264 0). Unlike the (nearly) log-concave distribution where the\nmarginal remains (nearly) log-concave, the class of s-concave distributions is not closed under the\nmarginalization operator. To study the marginal, our primary tool is the theory of convex geometry.\nSpeci\ufb01cally, we will use an extension of the Pr\u00e9kopa-Leindler inequality developed by Brascamp and\nLieb [11], which allows for a characterization of the integral of s-concave functions.\nTheorem 2 ([11], Thm 3.3). Let 0 < \u03bb < 1, and Hs, G1, and G2 be non-negative integrable\nfunctions on Rm such that Hs(\u03bbx + (1\u2212 \u03bb)y) \u2265 [\u03bbG1(x)s\u2295 (1\u2212 \u03bb)G2(y)s]1/s for every x, y \u2208 Rm.\n\nRm G2(x)dx(cid:1)\u03b3(cid:3)1/\u03b3 for s \u2265 \u22121/m, with\n\n+ (1 \u2212 \u03bb)(cid:0)(cid:82)\n\nThen(cid:82)\n\nRm Hs(x)dx \u2265 (cid:2)\u03bb(cid:0)(cid:82)\n\nRm G1(x)dx(cid:1)\u03b3\n\n\u03b3 = s/(1 + ms).\n\nBuilding on this, the following theorem plays a key role in our analysis of the marginal distribution.\nTheorem 3 (Marginal). Let f (x, y) be an s-concave density on a convex set K \u2286 Rn+m with\nm . Denote by K|Rn = {x \u2208 Rn : \u2203y \u2208 Rm s.t. (x, y) \u2208 K}. For every x in K|Rn, consider\ns \u2265 \u2212 1\nK(x) f (x, y)dy is\n\nthe section K(x) (cid:44) {y \u2208 Rm : (x, y) \u2208 K}. Then the marginal density g(x) (cid:44)(cid:82)\n\n\u03b3-concave on K|Rn, where \u03b3 = s\n\n1+ms . Moreover, if f (x, y) is isotropic, then g(x) is isotropic.\n\nSimilar to the marginal, the CDF of an s-concave distribution might not remain in the same class.\nThis is in sharp contrast to log-concave distributions. The following theorem studies the CDF of an\ns-concave distribution.\n1+ns and s \u2265 \u2212 1\nTheorem 4. The CDF of s-concave distribution in Rn is \u03b3-concave, where \u03b3 = s\nn .\n2When s \u2192 0, we note that lims\u21920(\u03bbf (x)s + (1 \u2212 \u03bb)f (y)s)1/s = exp(\u03bb log f (x) + (1 \u2212 \u03bb) log f (y)).\n\nIn this case, f (x) is known to be log-concave.\n\n4\n\n\t\t\t\tn-D\ts-concave\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1-D\t!-concave\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1-D\t\u210e#=%(1+)#)+/-\tExtension\tof\tPrekopa-Leindler\t\t\tBaseline\tFunction\t\fTheorem 3 and 4 serve as the bridge that connects high-dimensional s-concave distributions to\none-dimensional \u03b3-concave distributions. With them, we are able to reduce the high-dimensional\nproblem to the one-dimensional one.\n3.2 Fat-Tailed Density\nTail probability is one of the most distinct characteristics of s-concave distributions compared to\n(nearly) log-concave distributions. While it can be shown that the (nearly) log-concave distribution\nhas an exponentially small tail (Theorem 11, [7]), the tail of an s-concave distribution is fat, as proved\nby the following theorem.\nTheorem 5 (Tail Probability). Let x come from an isotropic distribution over Rn with an s-concave\ndensity. Then for every t \u2265 16, we have Pr[(cid:107)x(cid:107) >\n, where c is an\nabsolute constant.\nTheorem 5 is almost tight for s < 0. To see this, consider X that is drawn from a one-dimensional\nPareto distribution with density f (x) = (\u22121 \u2212 1\ns (x \u2265 s+1\u2212s ). It can be easily seen that\nfor t \u2265 s+1\u2212s , which matches Theorem 5 up to an absolute constant factor.\n\nnt] \u2264 (cid:104)\n\n(cid:105)(1+ns)/s\n\n(cid:104) \u2212s\n\n(cid:105) s+1\n\ns\n\ns )\u2212 1\n\ns x 1\n\n1 \u2212 cst\n\n1+ns\n\n\u221a\n\nPr[X > t] =\n\ns+1 t\n\n3.3 Geometry of S-Concave Distributions\nWe now investigate the geometry of s-concave distributions. We \ufb01rst consider one-dimensional s-\nconcave distributions: We provide bounds on the density of centroid-centered halfspaces (Lemma 6)\nand range of the density function (Lemma 7). Building upon these, we develop geometric properties\nof high-dimensional s-concave distributions by reducing the distributions to the one-dimensional\ncase based on marginalization (Theorem 3).\n3.3.1 One-Dimensional Case\nWe begin with the analysis of one-dimensional halfspaces. To bound the probability, a normal\ntechnique is to bound the centroid region and the tail region separately. However, the challenge is that\nthe s-concave distribution is fat-tailed (Theorem 5). So while the probability of a one-dimensional\nhalfspace is bounded below by an absolute constant for log-concave distributions, such a probability\nfor s-concave distributions decays as s (\u2264 0) becomes smaller. The following lemma captures such\nan intuition.\nLemma 6 (Density of Centroid-Centered Halfspaces). Let X be drawn from a one-dimensional\ndistribution with s-concave density for \u22121/2 \u2264 s \u2264 0. Then Pr(X \u2265 EX) \u2265 (1 + \u03b3)\u22121/\u03b3 for\n\u03b3 = s/(1 + s).\n\nWe also study the image of a one-dimensional s-concave density. The following condition for\ns > \u22121/3 is for the existence of second-order moment.\nLemma 7. Let g : R \u2192 R+ be an isotropic s-concave density function and s > \u22121/3. (a) For all x,\ng(x) \u2264 1+s\n\n1+3s ; (b) We have g(0) \u2265(cid:113) 1\n\n3(1+\u03b3)3/\u03b3 , where \u03b3 = s\n\ns+1 .\n\n3.3.2 High-Dimensional Case\n2n+3 \u2264\nWe now move on to the high-dimensional case (n \u2265 2). In the following, we will assume \u2212 1\ns \u2264 0. Though this working range of s vanishes as n becomes larger, it is almost the broadest range\nof s that we can hopefully achieve: Chandrasekaran et al. [13] showed a lower bound of s \u2265 \u2212 1\nn\u22121\nif one require the s-concave distribution to have good geometric properties. In addition, we can\nsee from Theorem 3 that if s < \u2212 1\nn\u22121, the marginal of an s-concave distribution might even not\nexist; Such a case does happen for certain s-concave distributions with s < \u2212 1\nn\u22121, e.g., the Cauchy\ndistribution. So our range of s is almost tight up to a 1/2 factor.\nWe start our analysis with the density of centroid-centered halfspaces in high-dimensional spaces.\nLemma 8 (Density of Centroid-Centered Halfspaces). Let f : Rn \u2192 R+ be an s-concave density\nH f (x)dx \u2265 (1 + \u03b3)\u22121/\u03b3 for\n\nfunction, and let H be any halfspace containing its centroid. Then(cid:82)\ns/(1+(n\u22121)s)-concave. Then by Lemma 6,(cid:82)\n\n\u03b3 = s/(1 + ns).\nProof. W.L.O.G., we assume H is orthogonal to the \ufb01rst axis. By Theorem 3, the \ufb01rst marginal of f is\n\nH f (x)dx \u2265 (1+\u03b3)\u22121/\u03b3, where \u03b3 = s/(1+ns).\n\n5\n\n\fThe following theorem is an extension of Lemma 7 to high-dimensional spaces. The proofs basically\nreduce the n-dimensional density to its \ufb01rst marginal by Theorem 3, and apply Lemma 7 to bound\nthe image.\nTheorem 9 (Bounds on Density). Let f : Rn \u2192 R+ be an isotropic s-concave density. Then\n(a) Let d(s, n) = (1 + \u03b3)\u22121/\u03b3 1+3\u03b2\n\n1+(n\u22121)s and \u03b3 = \u03b2\n\n3+3\u03b2 , where \u03b2 =\n\n(cid:17)1/s\n1+\u03b2 . For any u \u2208 Rn such that\n\n(cid:107)u(cid:107) \u2264 d(s, n), we have f (u) \u2265(cid:16)(cid:107)u(cid:107)\n\ns\n\n1+3\u03b2\n\nf (0).\n\nd ((2 \u2212 2\u2212(n+1)s)\u22121 \u2212 1) + 1\n\n(b) f (x) \u2264 f (0)\n(c) There exists an x \u2208 Rn such that f (x) > (4e\u03c0)\u2212n/2.\n\n(cid:104)(cid:16) 1+\u03b2\n(d) (4e\u03c0)\u2212n/2(cid:104)(cid:16) 1+\u03b2\n(f) For any line (cid:96) through the origin,(cid:82)\n\n(cid:112)3(1 + \u03b3)3/\u03b32n\u22121+1/s(cid:17)s \u2212 1\n(cid:105)1/s\n(cid:17)s \u2212 1\n(cid:105)\u2212 1\n(cid:112)3(1 + \u03b3)3/\u03b32n\u22121+ 1\n(cid:112)3(1 + \u03b3)3/\u03b32n\u22121+1/s(cid:17)s \u2212 1\n(cid:104)(cid:16) 1+\u03b2\n(cid:96) f \u2264 (2 \u2212 2\u2212ns)1/s (n\u22121)\u0393((n\u22121)/2)\n2\u03c0(n\u22121)/2dn\u22121 .\n\n(e) f (x) \u2264 (2 \u2212 2\u2212(n+1)s)1/s n\u0393(n/2)\n\nfor every x.\n\n2\u03c0n/2dn\n\n1+3\u03b2\n\n1+3\u03b2\n\ns\n\ns\n\n< f (0) \u2264 (2 \u2212 2\u2212(n+1)s)1/s n\u0393(n/2)\n2\u03c0n/2dn .\n\n(cid:105)1/s\n\nfor every x.\n\nTheorem 9 provides uniform bounds on the density function. To obtain more re\ufb01ned upper bound on\nthe image of s-concave densities, we have the following lemma. The proof is built upon Theorem 9.\nLemma 10 (More Re\ufb01ned Upper Bound on Densities). Let f : Rn \u2192 R+ be an isotropic s-concave\ndensity. Then f (x) \u2264 \u03b21(n, s)(1 \u2212 s\u03b22(n, s)(cid:107)x(cid:107))1/s for every x \u2208 Rn, where\n\n(1 \u2212 s)\u22121/sn\u0393(n/2)\n\n(cid:20)(cid:18) 1 + \u03b2\n(cid:113)\n(cid:112)3(1 + \u03b3)3/\u03b32n\u22121+1/s(cid:17)s \u2212 1\n(cid:105)\u22121\n\n(2 \u2212 2\u2212ns)\u2212 1\n\n1 + 3\u03b2\n\ns\n\n3(1 + \u03b3)3/\u03b32n\u22121+1/s\n\n[(a(n, s) + (1 \u2212 s)\u03b21(n, s)s)1+ 1\n\ns \u2212 a(n, s)1+ 1\n\ns ]s\n\n\u03b21(n, s)s(1 + s)(1 \u2212 s)\n1+\u03b2 , \u03b2 =\n\n, \u03b3 = \u03b2\n\ns\n\n1+(n\u22121)s , and\n\n(cid:21)1/s\n\n(cid:19)s \u2212 1\n\n,\n\n,\n\n\u03b21(n, s) =\n\n(2 \u2212 2\u2212(n+1)s) 1\n\ns\n\n2\u03c0n/2dn\n2\u03c0(n\u22121)/2dn\u22121\n\n\u03b22(n, s) =\n\na(n, s) = (4e\u03c0)\u2212ns/2(cid:104)(cid:16) 1+\u03b2\n\n(n \u2212 1)\u0393((n \u2212 1)/2)\n\n1+3\u03b2\n\nd = (1 + \u03b3)\n\n\u2212 1\n\n\u03b3 1+3\u03b2\n3+3\u03b2 .\n\n\u03b3\n\n1+\u03b3\n\n1+3\u03b3\n\n(cid:17)\u2212 1+\u03b3\n\n3+3\u03b3 where \u03b3 =\n\nMoreover, if t \u2264 d(s, n) (cid:44)(cid:16) 1+2\u03b3\n\nWe also give an absolute bound on the measure of band.\nTheorem 11 (Probability inside Band). Let D be an isotropic s-concave distribution in Rn. Denote\nby f3(s, n) = 2(1 + ns)/(1 + (n + 2)s). Then for any unit vector w, Prx\u223cD[|w\u00b7 x| \u2264 t] \u2264 f3(s, n)t.\n(cid:33)\u22121/\u03b3\n(cid:18)\n1+(n\u22121)s , then Prx\u223cD[|w \u00b7 x| \u2264 t] >\n\nf2(s, n)t, where f2(s, n) = 2(2 \u2212 2\u22122\u03b3)\u22121/\u03b3(4e\u03c0)\u22121/2\nTo analyze the problem of learning linear separators, we are interested in studying the disagreement\nbetween the hypothesis of the output and the hypothesis of the target. The following theorem captures\nsuch a characteristic under s-concave distributions.\nTheorem 12 (Probability of Disagreement). Assume D is an isotropic s-concave distribution in Rn.\nThen for any two unit vectors u and v in Rn, we have dD(u, v) = Prx\u223cD[sign(u\u00b7 x) (cid:54)= sign(v\u00b7 x)] \u2265\nf1(s, n)\u03b8(u, v), where f1(s, n) = c(2 \u2212 2\u22123\u03b1)\u2212 1\n(1 +\n\n(cid:112)3(1 + \u03b3)3/\u03b321+1/\u03b1(cid:17)\u03b1\u2212 1\n(cid:105)\u2212 1\n\n2\u03b3 (cid:19)\u03b3 \u2212 1\n(cid:17) 3+3\u03b3\n\n(cid:104)(cid:16) 1+\u03b2\n\n(cid:16) 1+2\u03b3\n\n(cid:32)\n\n1+\u03b3\n1+3\u03b3\n\n\u221a\n\n1+\u03b3\n\n2\n\n3\n\n.\n\ns\n\n\u03b1\n\n\u03b1\n\n1+3\u03b2\n\n, c is an absolute constant, \u03b1 =\n\ns\n\n1+(n\u22122)s , \u03b2 =\n\ns\n\n1+(n\u22121)s , \u03b3 = s\n\n1+ns .\n\n\u03b3)\u22122/\u03b3(cid:16) 1+3\u03b2\n\n3+3\u03b2\n\n(cid:17)2\n\nDue to space constraints, all missing proofs are deferred to the supplementary material.\n\n4 Applications: Provable Algorithms under S-Concave Distributions\n\nIn this section, we show that many algorithms that work under log-concave distributions behave well\nunder s-concave distributions by applying the above-mentioned geometric properties. For simplicity,\nwe will frequently use the notations in Theorem 1.\n\n6\n\n\f4.1 Margin Based Active Learning\nWe \ufb01rst investigate margin-based active learning under isotropic s-concave distributions in both\nrealizable and adversarial noise models. The algorithm (see Algorithm 1) follows a localization\ntechnique: It proceeds in rounds, aiming to cut the error down by half in each round in the margin [6].\n\nAlgorithm 1 Margin Based Active Learning under S-Concave Distributions\n\nInput: Parameters bk, \u03c4k, rk, mk, \u03ba, and T as in Theorem 16.\n1: Draw m1 examples from D, label them and put them into W .\n2: For k = 1, 2, ..., T\n3:\n\nFind vk \u2208 ball(wk\u22121, rk) to approximately minimize the hinge loss over W s.t. (cid:107)vk(cid:107) \u2264 1:\n(cid:96)\u03c4k \u2264 minw\u2208ball(wk\u22121,rk)\u2229ball(0,1) (cid:96)\u03c4k (w, W ) + \u03ba/8.\n\n4: Normalize vk, yielding wk = vk(cid:107)vk(cid:107) ; Clear the working set W .\n5: While mk+1 additional data points are not labeled\n6:\n7:\nOutput: Hypothesis wT .\n\nDraw sample x from D.\nIf |wk \u00b7 x| \u2265 bk, reject x; else ask for label of x and put into W .\n\n4.1.1 Relevant Properties of S-Concave Distributions\nThe analysis requires more re\ufb01ned geometric properties as below. Theorem 13 basically claims that\nthe error mostly concentrates in a band, and Theorem 14 guarantees that the variance in any 1-D\ndirection cannot be too large. We defer the detailed proofs to the supplementary material.\nTheorem 13 (Disagreement outside Band). Let u and v be two vectors in Rn and assume that\n\u03b8(u, v) = \u03b8 < \u03c0/2. Let D be an isotropic s-concave distribution. Then for any absolute constant\nc1 > 0 and any function f1(s, n) > 0, there exists a function f4(s, n) > 0 such that Prx\u223cD[sign(u \u00b7\nx) (cid:54)= sign(v \u00b7 x) and |v \u00b7 x| \u2265 f4(s, n)\u03b8] \u2264 c1f1(s, n)\u03b8, where f4(s, n) = 4\u03b21(2,\u03b1)B(\u22121/\u03b1\u22123,3)\n\u2212c1f1(s,n)\u03b13\u03b22(2,\u03b1)3 ,\nB(\u00b7,\u00b7) is the beta function, \u03b1 = s/(1 + (n \u2212 2)s), \u03b21(2, \u03b1) and \u03b22(2, \u03b1) are given by Lemma 10.\nTheorem 14 (1-D Variance). Assume that D is isotropic s-concave. For d given by Theorem 9\n(a), there is an absolute C0 such that for all 0 < t \u2264 d and for all a such that (cid:107)u \u2212 a(cid:107) \u2264 r and\n8\u03b21(2,\u03b7)B(\u22121/\u03b7\u22123,2)\n(cid:107)a(cid:107) \u2264 1, Ex\u223cDu,t [(a \u00b7 x)2] \u2264 f5(s, n)(r2 + t2), where f5(s, n) = 16 + C0\nf2(s,n)\u03b22(2,\u03b7)3(\u03b7+1)\u03b72 ,\n(\u03b21(2, \u03b7), \u03b22(2, \u03b7)) and f2(s, n) are given by Lemma 10 and Theorem 11, and \u03b7 =\n4.1.2 Realizable Case\nWe show that margin-based active learning works under s-concave distributions in the realizable case.\nlet D be an isotropic s-concave distribution in Rn.\nTheorem 15.\nThen for 0 < \u0001 < 1/4, \u03b4 > 0, and absolute constants c,\nthere is an algorith-\nc\u0001(cid:101) iterations, requires mk =\nm (see the supplementary material) that runs in T = (cid:100)log 1\nlabels in the k-th round, and outputs\nO\na linear separator of error at most \u0001 with probability at least 1 \u2212 \u03b4. In particular, when s \u2192 0 (a.k.a.\n\n(cid:16) f3 min{2\u2212kf4f\nlog-concave), we have mk = O(cid:0)n + log( 1+s\u2212k\n\nn log f3 min{2\u2212kf4f\n\nIn the realizable case,\n\n+log 1+s\u2212k\n\ns\n\n1+(n\u22122)s .\n\n\u22121\n1 ,d}\n\n2\u2212k\n\n\u22121\n1 ,d}\n\n2\u2212k\n\n)(cid:1).\n\n\u03b4\n\nIn the adversarial noise model, an adversary can choose any distribution (cid:101)P over Rn \u00d7{+1,\u22121} such\n\nBy Theorem 15, we see that the algorithm of margin-based active learning under s-concave dis-\ntributions works almost as well as the log-concave distributions in the resizable case, improving\nexponentially w.r.t. the variable 1/\u0001 over passive learning algorithms.\n4.1.3 Ef\ufb01cient Learning with Adversarial Noise\nthat the marginal D over Rn is s-concave but an \u03b7 fraction of labels can be \ufb02ipped adversarially. The\nanalysis builds upon an induction technique where in each round we do hinge loss minimization in\nthe band and cut down the 0/1 loss by half. The algorithm was previously analyzed in [3, 4] for the\nspecial class of log-concave distributions. In this paper, we analyze it for the much more general\nclass of s-concave distributions.\nTheorem 16. Let D be an isotropic s-concave distribution in Rn over x and the label y\nobey the adversarial noise model.\nIf the rate \u03b7 of adversarial noise satis\ufb01es \u03b7 < c0\u0001 for\nsome absolute constant c0,\nthen for 0 < \u0001 < 1/4, \u03b4 > 0, and an absolute constan-\nt c, Algorithm 1 runs in T = (cid:100)log 1\nc\u0001(cid:101) iterations, outputs a linear separator wT such that\nPrx\u223cD[sign(wT \u00b7 x) (cid:54)= sign(w\u2217 \u00b7 x)] \u2264 \u0001 with probability at least 1 \u2212 \u03b4. The label complexity\n\n(cid:16)\n\n(cid:17)(cid:17)\n\n\u03b4\n\n7\n\n\f(cid:16) [bk\u22121s+\u03c4k(1+ns)[1\u2212(\u03b4/(\n\n(cid:111)\n\nf5\n\n\u03c4k\n\nf2\n\n(cid:16)\n\n(cid:17)(cid:17)\n\n,\n\n(cid:16)\n5 2\u2212(k\u22121)(cid:17)\n\u03b4 ))(cid:1).\n\nn\n\n(cid:16)\n\n\u221a\n\n(cid:16)\n\nn3/2\n\n(cid:17)\n\ns\n\nr\n\n(cid:16)\n\n, and bk =\n\nof (cid:101)O\n\nn + log k+k2\n\n\u03b4\n\n\u221a\n\u03ba2\u03c4 2\n, \u03c4k = \u0398\n\n\u221a\n\u221a\nf2 min{bk\u22121,d} , bk\u22121\n\nn(k+k2)))s/(1+ns)]+\u03c4ks]2\nk s2\nf\u22122\n1 f\n\n\u22121/2\n2\n\nf3f 2\n\n4 f 1/2\n\u0001\u03b4 )(n + log( k\n\ns(1+(n+2)s)f (s) (1 \u2212 \u0001s/(1+ns))\n\n(1+ns)2\n\nn(1+ns)2\n\ns(1+(n+2)s)f1(s,n) (1 \u2212 \u0001\nn log(1/\u0001)).\n\n(cid:110)\n1 ), d}. In particular, if s \u2192 0, mk = O(cid:0)n log( n\n\nin the k-th round is mk = O\nf3\u03c4k\nwhere \u03ba = max\nmin{\u0398(2\u2212kf4f\u22121\nBy Theorem 16, the label complexity of margin-based active learning improves exponentially over\nthat of passive learning w.r.t. 1/\u0001 even under fat-tailed s-concave distributions and challenging\nadversarial noise model.\n4.2 Disagreement Based Active Learning\nWe apply our results to the analysis of disagreement-based active learning under s-concave dis-\ntributions. The key is estimating the disagreement coef\ufb01cient, a measure of complexity of active\nlearning problems that can be used to bound the label complexity [20]. Recall the de\ufb01nition of the\ndisagreement coef\ufb01cient w.r.t. classi\ufb01er w\u2217, precision \u0001, and distribution D as follows. For any r > 0,\nde\ufb01ne ballD(w, r) = {u \u2208 H : dD(u, w) \u2264 r} where dD(u, w) = Prx\u223cD[(u \u00b7 x)(w \u00b7 x) < 0].\nDe\ufb01ne the disagreement region as DIS(H) = {x : \u2203u, v \u2208 H s.t. (u \u00b7 x)(v \u00b7 x) < 0}. Let the\nAlexander capacity capw\u2217,D = PrD(DIS(ballD(w\u2217,r)))\n. The disagreement coef\ufb01cient is de\ufb01ned as\n\u0398w\u2217,D(\u0001) = supr\u2265\u0001[capw\u2217,D(r)]. Below, we state our results on the disagreement coef\ufb01cient under\nisotropic s-concave distributions.\nTheorem 17 (Disagreement Coef\ufb01cient). Let D be an isotropic s-concave distribution over Rn. For\nany w\u2217 and r > 0, the disagreement coef\ufb01cient is \u0398w\u2217,D(\u0001) = O\n.\n1+ns )\n\u221a\nIn particular, when s \u2192 0 (a.k.a. log-concave), \u0398w\u2217,D(\u0001) = O(\nOur bounds on the disagreement coef\ufb01cient match the best known results for the much less general\ncase of log-concave distributions [7]; Furthermore, they apply to the s-concave case where we allow\narbitrary number of discontinuities, a case not captured by [18]. The result immediately implies\nconcrete bounds on the label complexity of disagreement-based active learning algorithms, e.g.,\nCAL [14] and A2 [5]. For instance, by composing it with the result from [17], we obtain a bound\nfor agnostic active learning under an\nisotropic s-concave distribution D. Namely, it suf\ufb01ces to output a halfspace with error at most\nOP T + \u0001, where OP T = minw errD(w).\n4.3 Learning Intersections of Halfspaces\nBaum [9] provided a polynomial-time algorithm for learning the intersections of halfspaces w.r.t.\nsymmetric distributions. Later, Klivans [27] extended the result by showing that the algorithm works\nunder any distribution D as long as \u00b5D(E) \u2248 \u00b5D(\u2212E) for any set E. In this section, we show that it\nis possible to learn intersections of halfspaces under the broader class of s-concave distributions.\nTheorem 18. In the PAC realizable case, there is an algorithm (see the supplementary material) that\noutputs a hypothesis h of error at most \u0001 with probability at least 1 \u2212 \u03b4 under isotropic s-concave\nThe label complexity is M (\u0001/2, \u03b4/4, n2) + max{2m2/\u0001, (2/\u00012) log(4/\u03b4)},\ndistributions.\nwhere M (\u0001, \u03b4, m)\n=\nM (max{\u03b4/(4eKm1), \u0001/2}, \u03b4/4, n), K = \u03b21(3, \u03ba) B(\u22121/\u03ba\u22123,3)\nh(\u03ba)d3+1/\u03ba , d = (1 + \u03b3)\u22121/\u03b3 1+3\u03b2\n3+3\u03b2 ,\n(\u2212\u03ba\u03b22(3,\u03ba))3\n1+2\u03ba ,\n, \u03b2 = \u03ba\n1+(n\u22123)s . In particular, if s \u2192 0 (a.k.a. log-concave), K is an absolute constant.\n\n(cid:1), m2\n(cid:105)\u22121/\u03ba\n(cid:17)\u03ba\u22121\n\n1+\u03ba , and \u03ba =\n\n\u03b3 = \u03ba\n5 Lower Bounds\nIn this section, we give information-theoretic lower bounds on the label complexity of passive and\nactive learning of homogeneous halfspaces under s-concave distributions.\nTheorem 19. For a \ufb01xed value \u2212 1\n2n+3 \u2264 s \u2264 0 we have: (a) For any s-concave distribution D in\nRn whose covariance matrix is of full rank, the sample complexity of learning origin-centered linear\nseparators under D in the passive learning scenario is \u2126 (nf1(s, n)/\u0001); (b) The label complexity of\nactive learning of linear separators under s-concave distributions is \u2126 (n log (f1(s, n)/\u0001)).\nIf the covariance matrix of D is not of full rank, then the intrinsic dimension is less than d. So\nour lower bounds essentially apply to all s-concave distributions. According to Theorem 19, it is\npossible to have an exponential improvement of label complexity w.r.t. 1/\u0001 over passive learning by\nactive sampling, even though the underlying distribution is a fat-tailed s-concave distribution. This\nobservation is captured by Theorems 15 and 16.\n\nd ((2 \u2212 2\u22124\u03ba)\u22121 \u2212 1) + 1(cid:1) 1\n\n(cid:112)3(1+\u03b3)3/\u03b322+ 1\n\nh(\u03ba) = (cid:0) 1\n\n(cid:104)(cid:16) 1+\u03b2\n\n\u0001 log 1\n\n\u0001 + 1\n\n\u0001 log 1\n\n\u03b4\n\nO(cid:0) n\n\nlog2 1\n\n\u0001 + OP T 2\n\n\u00012\n\nby M (\u0001, \u03b4, n)\n\n\u03ba (4e\u03c0)\u2212 3\n\n2\n\n1+3\u03b2\n\n\u03ba\n\n=\n\n3+1/\u03ba\n\nis\n\nde\ufb01ned\n\n(cid:17)(cid:17)\n\ns\n\n8\n\n\f6 Conclusions\n\nIn this paper, we study the geometric properties of s-concave distributions. Our work advances the\nstate-of-the-art results on the margin-based active learning, disagreement-based active learning, and\nlearning intersections of halfspaces w.r.t. the distributions over the instance space. When s \u2192 0, our\nresults reduce to the best-known results for log-concave distributions. The geometric properties of\ns-concave distributions can be potentially applied to other learning algorithms, which might be of\nindependent interest more broadly.\nAcknowledgements. This work was supported in part by grants NSF-CCF 1535967, NSF CCF-\n1422910, NSF CCF-1451177, a Sloan Fellowship, and a Microsoft Research Fellowship.\n\nReferences\n[1] D. Applegate and R. Kannan. Sampling and integration of near log-concave functions. In ACM\n\nSymposium on Theory of Computing, pages 156\u2013163, 1991.\n\n[2] P. Awasthi, M.-F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing\n\nunder asymmetric noise. In Annual Conference on Learning Theory, pages 152\u2013192, 2016.\n\n[3] P. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for ef\ufb01ciently learning\nlinear separators with noise. In ACM Symposium on Theory of Computing, pages 449\u2013458,\n2014.\n\n[4] P. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for ef\ufb01ciently learning\n\nlinear separators with noise. Journal of the ACM, 63(6):50, 2017.\n\n[5] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer\n\nand System Sciences, 75(1):78\u201389, 2009.\n\n[6] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Annual Conference on\n\nLearning Theory, pages 35\u201350, 2007.\n\n[7] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave\n\ndistributions. In Annual Conference on Learning Theory, pages 288\u2013316, 2013.\n\n[8] M.-F. Balcan and H. Zhang. Noise-tolerant life-long matrix completion via adaptive sampling.\n\nIn Advances in Neural Information Processing Systems, pages 2955\u20132963, 2016.\n\n[9] E. B. Baum. A polynomial time algorithm that learns two hidden unit nets. Neural Computation,\n\n2(4):510\u2013522, 1990.\n\n[10] A. Beygelzimer, S. Dasgupta, and J. Langford.\n\nImportance weighted active learning.\n\nInternational Conference on Machine Learning, pages 49\u201356, 2009.\n\nIn\n\n[11] H. J. Brascamp and E. H. Lieb. On extensions of the Brunn-Minkowski and Pr\u00e9kopa-Leindler\ntheorems, including inequalities for log concave functions, and with an application to the\ndiffusion equation. Journal of Functional Analysis, 22(4):366\u2013389, 1976.\n\n[12] C. Caramanis and S. Mannor. An inequality for nearly log-concave distributions with applica-\n\ntions to learning. IEEE Transactions on Information Theory, 53(3):1043\u20131057, 2007.\n\n[13] K. Chandrasekaran, A. Deshpande, and S. Vempala. Sampling s-concave functions: The\nlimit of convexity based isoperimetry. In Approximation, Randomization, and Combinatorial\nOptimization. Algorithms and Techniques, pages 420\u2013433. 2009.\n\n[14] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine\n\nLearning, 15(2):201\u2013221, 1994.\n\n[15] A. Daniely. Complexity theoretic limitations on learning halfspaces. In ACM Symposium on\n\nTheory of computing, pages 105\u2013117, 2016.\n\n[16] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information\n\nProcessing Systems, volume 17, pages 337\u2013344, 2004.\n\n9\n\n\f[17] S. Dasgupta, D. J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In\n\nAdvances in Neural Information Processing Systems, pages 353\u2013360, 2007.\n\n[18] E. Friedman. Active learning for smooth problems. In Annual Conference on Learning Theory,\n\n2009.\n\n[19] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal\n\non Computing, 39(2):742\u2013765, 2009.\n\n[20] S. Hanneke. A bound on the label complexity of agnostic active learning. In International\n\nConference on Machine Learning, pages 353\u2013360, 2007.\n\n[21] S. Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends in\n\nMachine Learning, 7(2-3):131\u2013309, 2014.\n\n[22] A. T. Kalai, A. R. Klivans, Y. Mansour, and R. A. Servedio. Agnostically learning halfspaces.\n\nSIAM Journal on Computing, 37(6):1777\u20131805, 2008.\n\n[23] A. T. Kalai and S. Vempala. Simulated annealing for convex optimization. Mathematics of\n\nOperations Research, 31(2):253\u2013266, 2006.\n\n[24] D. M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classi\ufb01cation with comparison queries.\n\nIn IEEE Symposium on Foundations of Computer Science, pages 355\u2013366, 2017.\n\n[25] A. Klivans and P. Kothari. Embedding hard learning problems into gaussian space. International\nWorkshop on Approximation Algorithms for Combinatorial Optimization Problems, 28:793\u2013809,\n2014.\n\n[26] A. R. Klivans, P. M. Long, and R. A. Servedio. Learning halfspaces with malicious noise.\n\nJournal of Machine Learning Research, 10:2715\u20132740, 2009.\n\n[27] A. R. Klivans, P. M. Long, and A. K. Tang. Baum\u2019s algorithm learns intersections of halfspaces\nwith respect to log-concave distributions. In Approximation, Randomization, and Combinatorial\nOptimization. Algorithms and Techniques, pages 588\u2013600. 2009.\n\n[28] R. A. Servedio. Ef\ufb01cient algorithms in computational learning theory. PhD thesis, Harvard\n\nUniversity, 2001.\n\n[29] L. Wang. Smoothness, disagreement coef\ufb01cient, and the label complexity of agnostic active\n\nlearning. Journal of Machine Learning Research, 12(Jul):2269\u20132292, 2011.\n\n[30] Y. Xu, H. Zhang, A. Singh, A. Dubrawski, and K. Miller. Noise-tolerant interactive learning\nusing pairwise comparisons. In Advances in Neural Information Processing Systems, pages\n2428\u20132437, 2017.\n\n[31] S. Yan and C. Zhang. Revisiting perceptron: Ef\ufb01cient and label-optimal active learning of\n\nhalfspaces. arXiv preprint arXiv:1702.05581, 2017.\n\n[32] C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances\n\nin Neural Information Processing Systems, pages 442\u2013450, 2014.\n\n10\n\n\f", "award": [], "sourceid": 2501, "authors": [{"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Carnegie Mellon University"}, {"given_name": "Hongyang", "family_name": "Zhang", "institution": "Carnegie Mellon University"}]}