{"title": "The Perceptron Algorithm Is Fast for Non-Malicious Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 676, "page_last": 685, "abstract": null, "full_text": "676 \n\nBaum \n\nThe Perceptron Algorithm Is Fast tor \n\nNon-Malicious Distributions \n\nErice B. Baum \n\nNEC Research Institute \n4 Independence Way \nPrinceton, NJ 08540 \n\nAbstract: Within the context of Valiant's protocol for learning, the Perceptron \nalgorithm is shown to learn an arbitrary half-space in time O(r;;) if D, the proba(cid:173)\nbility distribution of examples, is taken uniform over the unit sphere sn. Here f is \nthe accuracy parameter. This is surprisingly fast, as \"standard\" approaches involve \nsolution of a linear programming problem involving O( 7') constraints in n dimen(cid:173)\nsions. A modification of Valiant's distribution independent protocol for learning \nis proposed in which the distribution and the function to be learned may be cho(cid:173)\nsen by adversaries, however these adversaries may not communicate. It is argued \nthat this definition is more reasonable and applicable to real world learning than \nValiant's. Under this definition, the Perceptron algorithm is shown to be a distri(cid:173)\nbution independent learning algorithm. In an appendix we show that, for uniform \ndistributions, some classes of infinite V-C dimension including convex sets and a \nclass of nested differences of convex sets are learnable. \n\n\u00a71: Introduction \n\nInterest in this algorithm waned in the 1970's after it was empha(cid:173)\n\nThe Percept ron algorithm was proved in the early 1960s[Rosenblatt,1962] to \nconverge and yield a half space separating any set of linearly separable classified \nexamples. \nsized[Minsky and Papert, 1969] (1) that the class of problems solvable by a single \nhalf space was limited, and (2) that the Perceptron algorithm, although converg(cid:173)\ning in finite time, did not converge in polynomial time. In the 1980's, however, it \nhas become evident that there is no hope of providing a learning algorithm which \ncan learn arbitrary functions in polynomial time and much research has thus been \nrestricted to algorithms which learn a function drawn from a particular class of \nfunctions. Moreover, learning theory has focused on protocols like that of [Valiant, \n1984] where we seek to classify, not a fixed set of examples, but examples drawn \nfrom a probability distribution. This allows a natural notion of \"generalization\" . \nThere are very few classes which have yet been proven learnable in polynomial time, \nand one of these is the class of half spaces. Thus there is considerable theoretical \ninterest now in studying the problem of learning a single half space, and so it is \nnatural to reexamine the Percept ron algorithm within the formalism of Valiant. \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n677 \n\nIn Valiant's protocol, a class of functions is called learnable if there is a learn(cid:173)\ning algorithm which works in polynomial time independent of the distribution D \ngenerating the examples. Under this definition the Perceptron learning algorithm \nis not a polynomial time learning algorithm. However we will argue in section 2 \nthat this definition is too restrictive. We will consider in section 3 the behavior of \nthe Perceptron algorithm if D is taken to be the uniform distribution on the unit \nsphere sn. In this case, we will see that the Perceptron algorithm converges re(cid:173)\nmarkably rapidly. Indeed we will give a time bound which is faster than any bound \nknown to us for any algorithm solving this problem. Then, in section 4, we will \npresent what we believe to be a more natural definition of distribution independent \nlearning in this context, which we will call N onmalicious distribution independent \nlearning. We will see that the Perceptron algorithm is indeed a polynomial time non(cid:173)\nmalicious distribution independent learning algorithm. In Appendix A, we sketch \nproofs that, if one restricts attention to the uniform distribution, some classes with \ninfinite Vapnik-Chervonenkis dimension such as the class of convex sets and the \nclass of nested differences of convex sets (which we define) are learnable. These \nresults support our assertion that distribution independence is too much to ask for, \nand may also be of independent interest. \n\n\u00a72: Distribution Independent Learning \n\nIn Valiant's protocol [Valiant , 1984], a class F of Boolean functions on ~n is \ncalled learnable if a learning algorithm A exists which satisfies the following condi(cid:173)\ntions. Pick some probability distribution D on ~n. A is allowed to call examples, \nwhich are pairs (x, I(x\u00bb, where x is drawn according to the distribution D. A is a \nvalid learning algorithm for F if for any probability distribution D on ~n, for any \no < 8, f < 1, for any I E F, A calls examples and, with probability at least 1 - 8 \noutputs in time bounded by a polynomial in n, 8- 1 , and f- 1 a hypothesis 9 such \nthat the probability that I(x) \"I g(x) is less than f for x drawn according to D. \nThis protocol includes a natural formalization of 'generalization' as predic(cid:173)\ntion.For more discussion see [Valiant, 1984]. The definition is restrictive in de(cid:173)\nmanding that A work for an arbitrary probability distribution D. This demand \nis suggested by results on uniform convergence of the empirical distribution to the \nactual distribution. In particular, if F has Vapnik-Chervonenkis (V-C) dimensionl1 \nd, then it has been proved[Blumer et al, 1987] that all A needs to do to be a valid \nlearning algorithm is to call MO(f, 8, d) = max(~logj, Sfdlog1f3) examples and to \nfind in polynomial time a function 9 E F which correctly classifies these. \nThus, for example, it is simple to show that the class H of half spaces is \nValiant learnable[Blumer et aI, 1987]. The V-C dimension of H is n + 1. All we \nneed to do to learn H is to call MO(f, 8, n + 1) examples and find a separating half \nspace using Karmarkar's algorithm [Karmarkar, 1984]. Note that the Perceptron \nalgorithm would not work here, since one can readily find distributions for which \nthe Perceptron algorithm would be expected to take arbitrarily long times to find \na separating half space. \n\n11 We say a set S C Rn is shattered by a class F of Boolean functions if F \ninduces all Boolean functions on S. The V -C dimension of F is the cardinality of \nthe largest set S which F shatters. \n\n\f678 \n\nBaum \n\nNow, however, it seems from three points of view that the distribution inde(cid:173)\npendent definition is too strong. First, although the results of [Blumer et al., 1987] \ntell us we can gather enough information for learning in polynomial time, they say \nnothing about when we can actually find an algorithm A which learns in polynomial \ntime. So far, such algorithms have only been found in a few cases, and (see, e.g. \n[Baum, 1989a]) these cases may be argued to be trivial. \n\nSecond, a few cl~es of functions have been proved (modulo strong but plau(cid:173)\nsible complexity theoretic hypotheses) unlearnable by construction of cryptograph(cid:173)\nically secure subclasses. Thus for example [Kearns and Valiant, 1988] show that \nthe class of feedforward networks of threshold gates of some constant depth, or of \nBoolean gates of logarithmic depth, is not learnable by construction of a crypto(cid:173)\ngraphically secure subclass. The relevance of such results to learning in the natural \nworld is unclear to us. For example, these results do not rule out a learning al(cid:173)\ngorithm that would learn almost any log depth net. We would thus prefer a less \nrestrictive definition of learnability, so that if a class were proved unlearnable, it \nwould provide a meaningful limit on pragmatic learning. \n\nThird, the results of [Blumer et aI, 1987] imply that we can only expect to learn \na class of functions F if F has finite V-C dimension. Thus we are in the position \nof assuming an enormous amount of information about the class of functions to be \nlearned- namely that it be some specific class of finite V-C dimension, but nothing \nwhatever about the distribution of examples. In the real world, by contrast, we \nare likely to know at least as much about the distribution D as we know about the \nclass of functions F. If we relax the distribution independence criterion, then it can \nbe shown that classes of infinite Vapnik-Chervonenkis dimension are learnable. For \nexample, for the uniform distribution, the class of convex sets and a class of nested \ndifferences of convex sets ( both of which trivially have infinite V -C dimension) are \nshown to be learnable in Appendix A. \n\n\u00a73: The Perceptron Algorithm and Uniform Distributions \n\nThe Percept ron algorithm yields, in finite time, a half-space (WH, ()H) which \ncorrectly classifies any given set of linearly separable examples [Rosenblatt,1962]. \nThat is, given a set of classified examples {z~} such that, for some (w~, ()~), W~ .z+ > \n()~ and W~ \u2022 z~ < ()~ for alII', the algorithm converges in finite time to output a \n( W H , () H) such that W H \u2022 z~ 2:: () Hand W H . z~ < () H. We will normalize so that \nw~ . w~ = 1. Note that Iw~ . z - ()~ I is the Euclidean distance from z to the separating \nhyperplane {y : W~ . Y = ()~}. \n\nThe algorithm is the following. Start with some initial candidate (wo, ()o), \nwhich we will take to be (0,0). Cycle through the examples. For each example, test \nwhether that example is correctly classified. If so, proceed to the next example. If \nnot, modify the candidate by \n\nwhere the sign of the modification is determined by the classification of the miss(cid:173)\nclassified example. \n\nIn this section we will apply the Perceptron algorithm to the problem of learning \n\n(1) \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n679 \n\nin the probabilistic context described in section 2, where however the distribution \nD generating examples is uniform on the unit sphere sn. Rather than have a \nfixed set of examples, we apply the algorithm in a slightly novel way: we call an \nexample, perform a Perceptron update step, discard the example, and iterate until \nwe converge to accuracy c/ 2 If we applied the Perceptron algorithm in the standard \nway, it seemingly would not converge as rapidly. We will return to this point at the \nend of this section. \n\nNow the number of updates the Perceptron algorithm must make to learn a \ngiven set of examples is well known to be O( f;), where I is the minimum distance \nfrom an example to the classifying hyperplane (see ego [Minsky and Papert, 1969]). \nIn order to learn to c accuracy in the sense of Valiant, we will observe that for \nthe uniform distribution we do not need to correctly classify examples closer to the \ntarget separating hyperplane than O( -7,:). Thus we will prove that the Perceptron \nalgorithm will converge (with probability 1 - 8) after O( ~) updates, which will \noccur after O( -!i) presentations of examples. \n\nIndeed take Ot = 0 so the target hyperplane passes through the origin. Parallel \nhyperplanes a distance tc/2 above and below the target hyperplane bound a band \n\nB of probability measure 1,,/2 \n\nP(tc) = \n\nn 2 A \n\nh/1 - z2) - dz ~ \n\nAn \n\n(2) \n\n-,,/2 \n\n(for n > 2), where An = f\u00ab~:+ll)/;) is the area of sn. See figure 1. Using the readily \n\nt \nK \nJ.. \n\nFigure 1: The target hyperplane intersects the sphere sn along its equator (if \nOe = 0) shown as the central line. Points in (say) the upper hemisphere are classifie.d \nas positive examples and those in the lower as negative examples. The band B 18 \nformed by intersecting the sphere with two planes parallel to the target hyperplane \nand\u00b7 a distance tc/2 above and below it. \n\n/2 We say that our candidate half space has accuracy c when the probability that \nit missclassifies an example drawn from D is no greater than c. \n\n\f680 \n\nBaum \n\nobtainable (e.g. by Stirling's formula) bound that AA:l < vn, and the fact that \nthe integrand is nowhere greater than 1, we find that for\", = \u20ac/2vn, \nthe band has \nIf Ot # 0, a band of width\", will have less measure than it \nmeasure less than \u20ac/2. \nwould for Ot = 0. We will thus continue to argue (without loss of generality) by \nassuming the worst case condition that Ot = 0. \n\nSince B has measure less than \u20ac/2, \n\nif we have not yet converged to accuracy \u20ac, \nthere is no more than probability 1/2 that the next example on which we update will \nbe in B. We will show that once we have made rno = rnax(144In!, ~) updates, we \nhave converged unless more than 7/12 of the updates are in B. The probability of \nmaking this fraction of the up dates in B, hC?wever, is less than 6/2 if the probability \nof each update lying in B is not more than 1/2. We conclude with confidence 1-6/2 \nthat the probability our next update will be in B is greater than 1/2 and thus that \nwe have converged to \u20ac-accuracy. \n\nIndeed, consider the change in the quantity \n\nwhen we update. \n\n(3) \n\nNow note that \u00b1(Wk . X:l:: - Ok) < \u00b0 since x was miss classified by (Wk' Ok) (else we \nwould not update). Let A = (=F(Wt\u00b7 x:l:: - Ot\u00bb. If x E B, then A < 0. If x rt. B, then \nA ~ -\",/2. Recalling x2 = 1, we see that tl.N < 2 for x E Band tl.N < -0'\" + 2 \nfor x rt. B. If we choose 0 = 8/\"\" we find that tl.N ~ -6 for x ~ B. Recall that, \nfor k = 0, with (Wo, (0) = (0,0), we have N = 0 2 = 64/\",2. Thus we see that if we \nhave made 0 updates on points outside B, and 1 updates on points in B, N < \u00b0 if \n\n60 - 21> 64/\",2. But N is positive semidefinite. Once we have made 48/\",2 tot'al \nupdates, at least 7/12 of the updates must thus have been on examples in B. \n\n(4) \n\nIf you assume that the probability of updates falling in B is less than 1/2 (and \nthus that our hypothesis half space is not yet at \u20ac - accuracy), then the probability \nthat more than 7/12 of mo = max(144In~, ~) updates fall in B is less than 6/2. \nTo see this define LE(p, m, r) as the probability of having at most r successes in m \nindependent Bernoulli trials with probability of success p and recall, [Angluin and \n\nValiant,1979], for \u00b0 < f3 < 1 that \n\n(5) \nApplying this formula with m = mo, p = 1/2, f3 = 1/6 shows the desired result. \nWe conclude that the probability of making rno updates without converging to \u20ac \naccuracy is less than 6/2. \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n681 \n\nHowever, as it approaches 1 - \u20ac accuracy, the algorithm will only update on a \nfraction \u20ac of the examples. To get, with confidence 1- 8/2, rno updates, it suffices to \ncall M = 2mo/\u20ac examples. Thus we see that the Perceptron algorithm converges, \nwith confidence 1 - 0, after we have called \n\n\u00b0 48n \nM = -max(144In-2 , -2 ) \n\n2 \n\u20ac \n\n\u20ac \n\n(6) \n\nexamples. \n\nEach example could be processed in time of order 1 on a \"neuron\" which \ncomputes Wk . x in time 1 and updates each of its \"synaptic weights\" in parallel. \nOn a serial computer, however, processing each example will take time of order n, \nso that we have a time of order O(n2/\u20ac3) \n\nfor convergence on a serial computer. \n\nThis is remarkably fast. The general learning procedure, described in section 2, \nis to call Mo(\u20ac, 0, n+1) examples and find a separating halfspace, by some polynomial \ntime algorithm for linear programming such as Karmarkar's algorithm. This linear \nprogramming problem thus contains 0(7) constraints in n dimensions. Even to \nwrite down the problem thus takes time o(nf~)' The upper time bound to solve this \ngiven by [Karmarkar, 1984] is O(n505\u20ac-2) . For large n the Percept ron algorithm is \nfaster by a factor of n305 \u2022 Of course it is likely that Karmarkar's algorithm could \nbe proved to work faster than O( n 505 ) for the particular distribution of examples \nof interest. If, however, Karmarkar's algorithm requires a number of iterations \ndepending even logarithmically on n, it will scale worse (for large n) than the \nPerceptron algorithm/3 \n\nNotice also that if we simply called Mo(\u20ac, 0, n + 1) examples and used the \nPerceptron algorithm, in the traditional way, to find a linear separator for this set \nof examples, our time performance would not be nearly as good. In fact, equation \n2 tells us that we would expect one of these examples to be a distance O( nt.g) from \nthe target hyperplane, since we are calling 0(7) examples and a band of width \nO( nf.s) has measure O( *). Thus this approach would take time O( ~), or a factor \nof n 2 worse than the one we have proposed. \nAn alternative approach to learning using only O( 7) examples, would be to \ncall MoCi, 0, n + 1) examples and apply the Perceptron algorithm to these until a \nfraction 1- \u20ac/2 had been correctly classified. This would suffice to assure that the \nhypothesis half space so generated would (with confidence 1 - 0) have error less \nthan \u20ac, as is seen from [Blumer et aI, 1987, Theorem A3.3]. It is unclear to us what \ntime performance this procedure would yield. \n\n\u00a74: Non-Malicious Distribution Independent Learning \n\nNext we propose modification of the distribution independence assumption, \nwhich we have argued is too strong to apply to real world learning. We begin \nwith an informal description. We allow an adversary (adversary 1) to choose the \n\n/3 We thank P. Vaidya for a discussion on this point. \n\n\f682 \n\nBaum \n\nfunction f in the class F to present to the learning algorithm A. We allow a second \nadversary (adversary 2) to choose the distribution D arbitrarily. We demand that \n(with probability 1 - 8) A converge to produce an (-accurate hypothesis g. Thus \nfar we have not changed Valiant's definition. Our restriction is simply that before \ntheir choice of distribution and function, adversaries 1 and 2 are not allowed to \nexchange information. Thus they must work independently. This seems to us an \nentirely natural and reasonable restriction in the real world. \n\nNow if we pick any distribution and any hyperplane independently, it is highly \nunlikely that the probability measure will be concentrated close to the hyperplane. \nThus we expect to see that under our restriction, the Perceptron algorithm is a \ndistribution independent learning algorithm for H and converges in time O( S;2) \non a serial computer. \n\nIf adversary 1 and adversary 2 do not exchange information, the least we can \nexpect is that they have no notion of a preferred direction on the sphere. Thus our \ninformal demand that these two adversaries do not exchange information should \n\nimply, at least, that adversary 1 is equally likely to choose any w, (relative e.g. to \n\nwhatever direction adversary 2 takes as his z axis). This formalizes, sufficiently for \nour current purposes, the notion of Nonmalicious Distribution Independence. \nTheorem 1: Let U be the uniform probability measure on sn and D any other \nprobability distribution on sn. Let R be any region on sn of U-measure (8 and \nlet z label some point in R. Choose a point y on sn randomly according to U. \nConsider the region R' formed by translating R rigidly so that z is mapped to y. \nThen the probability that the measure D(R/) > ( is less than 8. \nProof: Fix any point z E sn. Now choose y and thus R'. The probability z E R' is \n(8. Thus in particular, if we choose a point p according to D and then choose R', \nthe probability that pER' is (8. \n\nN ow assume that there is probability greater than 8 that D( R/) > (. Then we \narrive immediately at a contradiction, since we discover that the probability that \np E Fe is greater than (8. Q.E.D. \n\nCorollary 2: The Perceptron algorithm is aNon-malicious distribution indepen(cid:173)\ndent learning algorithm for half spaces on the unit sphere which converges, with \nconfidence 1 - {) to accuracy 1 - ( in time of order O( S;2) on a serial computer. \nProof sketch: Let \",, = (8/2fo,. Apply Theorem 1 to show that a band formed by \nhyperplanes a distance \",, /2 on either side of the target hyperplane has probability \nless than 8 of having measure for examples greater than (/2. Then apply the \narguments of the last section, with \",' in place of \"'. Q.E.D. \n\nAppendix A: Convex Sets Are Learnable for Uniform Distribution \n\nIn this appendix we sketch proofs that two classes of functions with infinite \nV -C dimension are learnable. These classes are the class of convex sets and a class \nof nested differences of convex sets which we define. These results support our \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n683 \n\nconjecture that full distribution independence is too restrictive a criterion to ask \nfor if we want our results to have interesting applications. We believe these results \nare also of independent interest. \n\nTheorem 3: The class C of convex sets is learnable in time polynomial in (-1 and \n6- 1 if the distribution of examples is uniform on the unit square in d dimensions. \n\nRemarks: (1) C is well known to have infinite V-C dimension. (2) So far as we \nknow, C is not learnable in time polynomial in d as well. \n\nProof Sketch:/ 4 We work, for simplicity, in 2 dimensions. Our arguments can readily \nbe extended to d dimensions. \n\nThe learning algorithm is to call M examples (where M will be specified). The \npositive examples are by definition within the convex set to be learned. Let M+ be \nthe set of positive examples. We classify examples as negative if they are linearly \nseparable from M+, i.e. outside of c+, the convex hull of M+. \n\nClearly this approach will never missclassify a negative example, but may miss(cid:173)\n\nclassify positive examples which are outside c+ and inside Ct. To show (- accuracy, \n\nU \n~~~~II \nlllllUHf ~~ ~f=: \n~ ~ \n~~~ \n~t?0 \n\n~~ \n~l== \n\n~~ \n\nt?0~ ~II \n\n~~~ \n\n\u00a7~ \nE~ \nE~ ~ \n\n=~~~ mf \n~~E= \n\nFigure 2: The boundary of the target concept Ct is shown. The set It of little \nsquares intersecting the boundary of c, are hatched vertically. The set 12 of squares \nthese examples contains all points inside c, except possibly those in It, 12 , or 13 \u2022 \n\njust inside Ii are hatched horizontally. The set 13 of squares just inside 12 are \nhatched diagonally. If we have an example in each square in 12, the convex hull of \n\n/4 This proofis inspired by arguments presented in [Pollard, 1984], pp22-24. After \nthis proof was completed, the author heard D. Haussler present related, unpublished \nresults at the 1989 Snowbird meeting on Neural Computation. \n\n\f684 \n\nBaum \n\nwe must choose M large enough so that, with confidence 1 - 8, the symmetric \ndifference of the target set C. and c+ has area less than f. \n\nDivide the unit square into k2 equal subsquares. (See figure 2.) Call the set \nof subsquares which the boundary of Ct intersects II. It is easy to see that the \ncardinality of II is no greater than 4k. The set 12 of subsquares just inside 11 also \nhas cardinality no greater than 4k, and likewise for the set 13 of subsquares just \ninside 12 \u2022 If we have an example in each of the squares in 12 , then Ct and C+ clearly \nhave symmetric difference at most equal the area of 11 U 12 U 13 < 12k X k- 2 = 12/ k. \nThus take k = 12/f. Now choose M sufficiently large so that after M trials there is \nless than 8 probability we have not got an example in each of the 4k squares in 12 \u2022 \nThus we need LE(k- 2 ,M,4k) < 8. Using equation 5, we see that M = 5f~oln8 will \nsuffice. Q.E.D. \n\nActually, one can learn (for uniform distributions) a more complex class of \nfunctions formed out of nested convex regions. For any set {C1, C2, \u2022\u2022. , c,} of I convex \nregions in ~d, let R1 = C1 and for j = 2, ... ,1 let Rj = Rj-1 n Cj. Then define a \nconcept f = R1 - R2 + R3 -\n\u2022.. R,. The class C of concepts so formed we call nested \nconvex sets. See figure 3. \n\nc, \n\nFigure 3: Cl is the five sided region, C2 is the tria~gular region, and Cs is the \nsquare. The positive region C1 - C2 U C1 + C3 U C2 U C1 IS shaded. \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n685 \n\nThis class can be learned by an iterative procedure which peels the onion. Call \na sufficient number of examples. (One can easily see that a number polynomial in \nI, f, and 6 but of course exponential in d will suffice.) Let the set of examples so \nobtained be called S. Those negative examples which are linearly separable from all \npositive examples are in the outermost layer. Class these in set Sl. Those positive \nexamples which are linearly separable from all negative examples in S - Sl lie in \nthe next layer- call this set of positive examples S2. Those negative examples in \nS - Sl linearly separable from all positive examples in S - S2 lie in the next layer, \nS3. In this way one builds up I + 1 sets of examples. (Some of these sets may \nbe empty.) One can then apply the methods of Theorem 3 to build a classifying \nfunction from the outside in. If the innermost layer S,+1 is (say) negative examples, \nthen any future example is called negative if it is not linearly separable from S'+1, \nor is linearly separable from S, and not linearly separable from S,-1, or is linearly \nseparable from S,-2 but not linearly separable from S,-3, etc. \n\nAcknowledgement: I would like to thank L.E. Baum for conversations and L. G. \nValiant for conunents on a draft. Portions of the work reported here were per(cid:173)\nformed while the author was an employee of Princeton University and of the Jet \nPropulsion Laboratory, California Institute of Technology, and were supported by \nNSF grant DMR-8518163 and agencies of the US Department of Defence including \nthe Innovative Science and Technology Office of the Strategic Defence Initiative \nOr ganization. \n\nReferences \n\nANGLUIN, D., VALIANT, L.G. (1979), Fast probabilistic algorithms for Hamilto(cid:173)\nnian circuits and matchings, J. of Computer and Systems Sciences, 18, pp 155-193. \nBAUM, E.B., (1989), On learning a union of half spaces, Journal of Complexity \nV5, N4. \nBLUMER, A., EHRENFEUCHT,A., HAUSSLER,D., and WARMUTH,M. (1987), \nLearnability and the Vapnik-Chervonenkis Dimension, U.C.S.C. tech. rep. UCSC(cid:173)\nCRL-87-20, and J. ACM, to appear. \nKARMARKAR, N., (1984), A new polynomial time algorithm for linear program(cid:173)\nming, Combinatorica 4, pp373-395 \nKEARNS, M, and VALIANT, L., (1989), Cryptographic limitations on learning \nBoolean formulae and finite automata, Proc. 21st ACM Symp. on Theory of \nComputing, pp433-444. \nMINSKY, M, and PAPERT,S., (1969), Perceptrons, and Introduction to Computa(cid:173)\ntional Geometry, MIT Press, Cambridge MA. \nPOLLARD, D. (1984), Convergence of stochastic processes, New York: Springer(cid:173)\nVerlag. \nROSENBLATT, F. (1962), Principles of Neurodynamics, Spartan Books, N.Y. \nVALIANT, L.G., (1984), A theory of the learnable, Conun. of ACM V27, Nll, \npp1l34-1142. \n\n\f", "award": [], "sourceid": 226, "authors": [{"given_name": "Eric", "family_name": "Baum", "institution": null}]}