{"title": "Efficient Learning of Linear Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 189, "page_last": 195, "abstract": null, "full_text": "Efficient Learning of Linear Perceptrons \n\nShai Ben-David \n\nDepartment of Computer Science \n\nTechnion \n\nHaifa 32000, Israel \n\nHans Ulrich Simon \n\nFakultat fur Mathematik \nRuhr Universitat Bochum \nD-44780 Bochum, Germany \n\nshai~cs.technion.ac.il \n\nsimon~lmi.ruhr-uni-bochum.de \n\nAbstract \n\nWe consider the existence of efficient algorithms for learning the \nclass of half-spaces in ~n in the agnostic learning model (Le., mak(cid:173)\ning no prior assumptions on the example-generating distribution). \nThe resulting combinatorial problem - finding the best agreement \nhalf-space over an input sample - is NP hard to approximate to \nwithin some constant factor. We suggest a way to circumvent this \ntheoretical bound by introducing a new measure of success for such \nalgorithms. An algorithm is IL-margin successful if the agreement \nratio of the half-space it outputs is as good as that of any half-space \nonce training points that are inside the IL-margins of its separating \nhyper-plane are disregarded. We prove crisp computational com(cid:173)\nplexity results with respect to this success measure: On one hand, \nfor every positive IL, there exist efficient (poly-time) IL-margin suc(cid:173)\ncessful learning algorithms. On the other hand, we prove that \nunless P=NP, there is no algorithm that runs in time polynomial \nin the sample size and in 1/ IL that is IL-margin successful for all \nIL> O. \n\n1 \n\nIntroduction \n\nWe consider the computational complexity of learning linear perceptrons for arbi(cid:173)\ntrary (Le. non -separable) data sets. While there are quite a few perceptron learning \nalgorithms that are computationally efficient on separable input samples, it is clear \nthat 'real-life' data sets are usually not linearly separable. The task of finding a lin(cid:173)\near perceptron (i.e. a half-space) that maximizes the number of correctly classified \npoints for an arbitrary input labeled sample is known to be NP-hard. Furthermore, \neven the task of finding a half-space whose success rate on the sample is within \nsome constant ratio of an optimal one is NP-hard [1]. \n\nA possible way around this problem is offered by the support vector machines \nparadigm (SVM) . In a nutshell, the SVM idea is to replace the search for a linear \nseparator in the feature space of the input sample, by first embedding the sample \ninto a Euclidean space of much higher dimension, so that the images of the sample \npoints do become separable, and then applying learning algorithms to the image \nof the original sample. The SVM paradigm enjoys an impressive practical success, \nhowever, it can be shown ([3]) that there are cases in which such embeddings are \n\n\fbound to require high dimension and allow only small margins, which in turn entails \nthe collapse of the known generalization performance guarantees for such learning. \n\nWe take a different approach. While sticking with the basic empirical risk mini(cid:173)\nmization principle, we propose to replace the worst-case-performance analysis by an \nalternative measure of success. The common definition of the approximation ratio \nof an algorithm, requires the profit of an algorithm to remain within some fixed \nratio from that of an optimal solution for all inputs, we allow the relative quality \nof our algorithm to vary between different inputs. For a given input sample, the \nnumber of points that the algorithm's output half-space should classify correctly \nrelates not only to the success rate of the best possible half-space, but also to the \nrobustness of this rate to perturbations of the hyper-plane. This new success re(cid:173)\nquirement is intended to provide a formal measure that, while being achievable by \nefficient algorithms, retains a guaranteed quality of the output 'whenever possible'. \n\nThe new success measure depends on a margin parameter p,. An algorithm is \ncalled p,-margin successful if, for any input labeled sample, it outputs a hypothesis \nhalf-space that classifies correctly as many sample points as any half-space can \nclassify correctly with margin p, (that is, discounting points that are too close to \nthe separating hyper-plane). \n\nConsequently, a p,-margin successful algorithm is required to output a hypothesis \nwith close-to-optimal performance on the input data (optimal in terms of the num(cid:173)\nber of correctly classified sample points), whenever this input sample has an optimal \nseparating hyper-plane that achieves larger-than-p, margins for most of the points \nit classifies correctly. On the other hand, if for every hyper-plane h that achieves \nclose-to-maximal number of correctly classified input points, a large percentage of \nthe correctly classified points are close to h's boundary, then an algorithm can settle \nfor a relatively poor success ratio without violating the p,-margin success criterion. \n\nWe obtain a crisp analysis of the computational complexity of perceptron learning \nunder the p,-margin success requirement: \n\nOn one hand, for every p, > 0 we present an efficient p,-margin \nsuccessful learning algorithm (that is, an algorithm that runs in \ntime polynomial in both the input dimension and the sample size). \nOn the other hand, unless P=NP, no algorithm whose running time \nis polynomial in the sample size and dimension and in 1/ p, can be \np,-margin successful for all p, > O. \n\nNote, that by the hardness of approximating linear perceptrons result of [1] cited \nabove, for p, = 0, p,-margin learning is NP hard (even NP-hard to approximate). \n\nWe conclude that the new success criterion for learning algorithms provides a rigor(cid:173)\nous success guarantee that captures the constraints imposed on perceptron learning \nby computational efficiency requirements. \n\nIt is well known by now that margins play an important role in the analysis of genera(cid:173)\nlization performance (or sample complexity). The results of this work demonstrate \nthat a similar notion of margins is a significant component in the determination of \nthe computational complexity of learning as well. \n\nDue to lack of space, in this extended abstract we skip all the technical proofs. \n\n\f2 Definition and Notation \n\nWe shall be interested in the problem of finding a half-space that maximizes the \nagreement with a given labeled input data set. More formally, \n\nBest Separating Hyper-plane (BSH) Inputs are of the form (n, S), where n 2: \n1, and S = {(Xl, 17d, ... , (Xm, 17m)} is finite labeled sample, that is, each Xi \nis a point in lRn and each 17i is a member of {+1, -I}. A hyper-plane h(w, t), \nwhere w E lRn and t E lR, correctly classifies (X, 17) if sign( < wx > -t) = 17 \nwhere < wx > denotes the dot product of the vectors w and x. \nWe define the profit of h = h(w, t) on S as \n\nfi (hiS) = l{(xi,17i): h correctly classifies (Xi, 17i)}1 \n\nlSI \n\npro t \n\nThe goal of a Best Separating Hyper-plane algorithm is to find a pair (w, t) \nso that profit(h(w, t)IS) is as large as possible. \n\nIn the sequel, we refer to an input instance with parameter n as a n-dimensional \ninput. \n\nOn top of the Best Separating Hyper-plane problem we shall also refer to the fol(cid:173)\nlowing combinatorial optimization problems: \n\nBest separating Homogeneous Hyper-plane (BSHH) - The same problem \nas BSH, except that the separating hyper-plane must be homogeneous, \nthat is, t must be set to zero. The restriction of BSHH to input points from \nsn-l, the unit sphere in lRn, is called Best Separating Hemisphere Problem \n(BSHem) in the sequel. \n\nDensest Hemisphere (DHem) Inputs are of the form (n, P), where n 2: 1 and \nP is a list of (not necessarily different) points from sn-l - the unit sphere \nin lRn. The problem is to find the Densest Hemisphere for P, that is, a \nweight vector wE lRn such that H+(w, 0) contains as many points from P \nas possible (accounting for their multiplicity in P). \n\nDensest Open Ball (DOB) Inputs are of the form (n, P), where n 2: 1, and P \nis a list of points from lRn. The problem is to find the Densest Open Ball of \nradius 1 for P, that is, a center z E lRn such that B(z, 1) contains as many \npoints from P as possible (accounting for their multiplicity in P). \n\nFor the sake of our proofs, we shall also have to address the following well studied \noptimization problem: \n\nMAX-E2-SAT Inputs are of the form (n, C), where n 2: 1 and C is a collection of \n2-clauses over n Boolean variables. The problem is to find an assignment \na E {O, l}n satisfying as many 2-clauses of C as possible. \n\nMore generally, a maximization problem defines for each input instance 1 a set \nof legal solutions, and for each (instance, legal-solution) pair (I, a), it defines \nprofit (I, a) E lR+ - the profit of a on I. \nFor each maximization problem II and each input instance 1 for II, optrr (I) denotes \nthe maximum profit that can be realized by a legal solution for I. Subscript II is \nomitted when this does not cause confusion. The profit realized by an algorithm A \non input instance 1 is denoted by A(I). The quantity \n\nopt (I) - A(I) \n\nopt (I) \n\n\fis called the relative error of algorithm A on input instance I. A is called 0-\napproximation algorithm for II, where 0 E R+, if its relative error on I is at most 0 \nfor all input instances I. \n\n2.1 The new notion of approximate optimization: JL-margin \n\napproximation \n\nAs mentioned in the introduction, we shall discuss a variant of the above common \nnotion of approximation for the best separating hyper-plane problem (as well as for \nthe other geometric maximization problems listed above). The idea behind this new \nnotion, that we term 'JL-margin approximation', is that the required approximation \nrate varies with the structure of the input sample. When there exist optimal solu(cid:173)\ntions that are 'stable', in the sense that minor variations to these solutions will not \neffect their cost, then we require a high approximation ratio. On the other hand, \nwhen all optimal solutions are 'unstable' then we settle for lower approximation \nratios. \n\nThe following definitions focus on separation problems, but extend to densest set \nproblems in the obvious way. \n\nDefinition 2.1 Given a hypothesis class 11. = Un li n , where each li n is a collection \nof subsets of Rn, and a parameter JL ~ 0, \n\n\u2022 A margin function is a function M : Un(lin x Rn) r-+ R+. That is, given \na hypothesis h C Rn and a point x E Rn, M (h, x) is a non-negative real \nnumber - the margin of x w. r. t. h. In this work, in most cases M (h, x) \nis the Euclidean distance between x and the boundary of h, normalized by \nIIxl12 and, for linear separators, by the 2-norm of the hyper-plane h as well. \n\u2022 Given a finite labeled sample S and a hypothesis h E li n , the profit realized \n\nby h on S with margin JL is \nprofit(hIS, JL) = I{(Xi,1Ji): h correctly classifies (Xi,1Ji) and M(h, Xi) ~ JL} I \n\nlSI \n\n\u2022 For a labeled sample S, let optl-'(S) ~f maxhEl\u00a3(profit(hIS,JL)) \n\n\u2022 h E li n \noptl-' (S). \n\nis a JL-margin approximation for S w.r.t. \n\n11. if profit(hIS) > \n\n\u2022 an algorithm A is JL-successful for 11. if for every finite n-dimensional input \nS it outputs A(S) E li n which is a JL-margin approximation for S w.r.t. 11.. \n\n\u2022 Given any of the geometric maximization problem listed above, II, its JL(cid:173)\nrelaxation is the problem of finding, for each input instance of II a JL-margin \napproximation. For a given parameter JL > 0, we denote the JL-relaxation \nof a problem II by II[JL]. \n\n3 Efficient J.1, - margin successful learning algorithms \n\nOur Hyper-plane learning algorithm is based on the following result of Ben-David, \nEiron and Simon [2] \n\nTheorem 3.1 For every (constant) JL > 0, there exists a JL-margin successful poly(cid:173)\nnomial time algorithm AI-' for the Densest Open Ball Problem. \n\n\fWe shall now show that the existence of a J.L-successful algorithm for Densest Open \nBalls implies the existence of J.L-successful algorithms for Densest Hemispheres and \nBest Separating Homogeneous Hyper-planes. Towards this end we need notions of \nreductions between combinatorial optimization problems. The first definition, of \na cost preserving polynomial reduction, is standard, whereas the second definition \nis tailored for our notion of J.L-margin success. Once this, somewhat technical, \npreliminary stage is over we shall describe our learning algorithms and prove their \nperformance guarantees. \n\nDefinition 3.2 Let II and II' be two maximization problems. A cost preserving \npolynomial reduction from II to II', written as II:S~~III' consists of the following \ncomponents: \n\n\u2022 a polynomial time computable mapping which maps input instances of II to \ninput instances of II', so that whenever I is mapped to I', opt( I') ~ opt( I). \n\n\u2022 for each I, a polynomial time computable mapping which maps each legal \nsolutions (J' for I' to a legal solution (J for I having the same profit that (J' \u2022 \n\nThe following result is evident: \n\nLemma 3.3 If II:S~~III' and there exists a polynomial time c5-approximation algo(cid:173)\nrithm for II', then there exists a polynomial time c5-approximation algorithm for \nII. \n\nClaim 3.4 BSH O. A cost preserving polynomial reduction from II[J.Ll to II' [J.L']' written \nas II[J.Ll:S~~III'[J.L'l, consists of the following components: \n\n\u2022 a polynomial time computable mapping which maps input instances of II \nto input instances of II' , so that whenever I is mapped to I', opt\", (I') ~ \nopt\" (I). \n\n\u2022 for each I, a polynomial time computable mapping which maps each legal \nsolutions (J' for I' to a legal solution (J for I having the same profit that (J' \u2022 \n\nThe following result is evident: \n\nLemma 3.6 If II[J.Ll:S~~III'[J.L'l and there exists a polynomial time J.L-margin suc(cid:173)\ncessful algorithm for II, then there exists a polynomial time J.L'-margin successful \nalgorithm for II' . \n\n\fTo conclude our reduction of the Best Separating Hyper-plane problem to the Dens(cid:173)\nest open Ball problem we need yet another step. \nLemma 3.8 For p, > 0, let p,' = 1- J1- p,2 and p,\" = p,2/2. Then, \n\nDHem[p,]~~~IDOB[p,I]~~~IDOB[p,\"] \n\nThe proof is a bit technical and is deferred to the full version of this paper. \n\nApplying Theorem 3.1 and the above reductions, we therefore get: \n\nTheorem 3.9 For each (constant) p, > 0, there exists a p,-successful polynomial \ntime algorithm AIL for the Best Separating Hyper-plane problem. \n\nClearly, the same result holds for the problems BSHH, DHem and BSHem as well. \n\nLet us conclude by describing the learning algorithms for the BSH (or BSHH) \nproblem that results from this analysis. \n\nWe construct a family (AkhEN of polynomial time algorithms. Given a labeled \ninput sample S, the algorithm Ak exhaustively searches through all subsets of S of \nsize ~ k. For each such subset, it computes a hyper-plane that separates the positive \nfrom the negative points of the subset with maximum margin (if a separating hyper(cid:173)\nplane exists). The algorithm then computes the number of points in S that each \nof these hyper-planes classifies correctly, and outputs the one that maximizes this \nnumber. \n\nIn [2] we prove that our Densest Open Ball algorithm is p,-successful for p, = \n1/~ (when applied to all k-size subsamples). Applying Lemma 3.8, we may \nconclude for problem BSH that, for every k, Ak is (4/(k-1))1/4-successful. In other \nwords: in order to be p,-successful, we must apply algorithm Ak for k = 1 + 14/ p,4l. \n\n4 NP-Hardness Results \n\nWe conclude this extended abstract by proving some NP-hardness results that com(cid:173)\nplement rather tightly the positive results of the previous section. We shall base \nour hardness reductions on two known results. \nTheorem 4.1 [Hiistad, [4]] Assuming P;6NP, for any a < 1/22, there is no \npolynomial time a-approximation algorithm for MAX-E2-SAT. \n\nTheorem 4.2 [Ben-David, Eiron and Long, [1]] Assuming P;6NP, for any \na < 3/418, there is no polynomial time a-approximation algorithm for BSH. \n\nApplying Claim 3.4 we readily get: \nCorollary 4.3 Assuming P;6NP, for any a < 3/418, there is no polynomial time \na-approximation algorithm for BSHH, BSHem, or DHem. \n\nSo far we discussed p,-relaxations only for a value of p, that was fixed regardless \nof the input dimension. All the above discussion extends naturally to the case of \ndimension-dependent margin parameter. Let p, denote a sequence (P,l, ... , P,n, . . . ). \nFor a problem TI, its p,-relaxation refers to the problem obtained by considering the \nmargin value P,n for inputs of dimension n. A main tool for proving hardness is \n\n\fthe notion of p-Iegal input instances. An n-dimensional input sample S is called \np-Iegal if the maximal profit on S can be achieved by a hypothesis h* that satisfies \nprofit(h* IS) = profit(h* IS, ILn). Note that the p-relaxation of a problem is NP(cid:173)\nhard, if the problem restricted to p-Iegal input instances is NP-hard. \n\nUsing a special type of reduction, that due to space constrains we cannot elaborate \nhere, we can show that Theorem 4.1 implies the following: \n\nTheorem 4.4 \n\n1. Assuming P;6NP, there is no polynomial time 1/198-\napproximation for BSH even when only 1/v'36n-legal input instances are \nallowed. \n\n2. Assuming P;6NP, there is no polynomial time 1/198-approximation for \n\nBSHH even when only 1/ y'45(n + I)-legal input instances are allowed. \n\nUsing the standard cost preserving reduction chain from BSHH via BSHem to \nDHem, and noting that these reductions are obviously margin-preserving, we get \nthe following: \n\nCorollary 4.5 Let S be one of the problems BSHH, BSHem, or DHem, and let p \nbe given by ILn = 1/ y'45(n + 1). Unless P=NP, there exists no polynomial time \n1/198-approximation for S[p]. In particular, the p-relaxations of these problems are \nNP-hard. \n\nSince the 1/ y'45(n + I)-relaxation of the Densest Hemisphere Problem is NP-hard, \napplying Lemma 3.8 we get immediately \n\nCorollary 4.6 The 45(~+1) -relaxation of the Densest Ball Problem is NP-hard. \n\nFinally note that Corollaries 4.4, 4.5 and 4.6 rule out the existence of \"strong \nschemes\" (AI') with running time of AI' being also polynomial in 1/1.\u00a3. \n\nReferences \n\n[1] Shai Ben-David, Nadav Eiron, and Philip Long. On the difficulty of approxi(cid:173)\nmately maximizing agreements. Proceedings of the Thirteenth Annual Confer(cid:173)\nence on Computational Learning Theory (COLT 2000), 266-274. \n\n[2] Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon. The computational \ncomplexity of densest region detection. Proceedings of the Thirteenth Annual \nConference on Computational Learning Theory (COLT 2000), 255-265. \n\n[3] Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon. Non-embedability in \n\nEuclidean Half-Spaces. Technion TR, 2000. \n\n[4] Johan Hastad. Some optimal inapproximability results. In Proceedings of the \n\n29th Annual Symposium on Theory of Computing, pages 1- 10, 1997. \n\n\f", "award": [], "sourceid": 1808, "authors": [{"given_name": "Shai", "family_name": "Ben-David", "institution": null}, {"given_name": "Hans-Ulrich", "family_name": "Simon", "institution": null}]}