{"title": "Complexity of Finite Precision Neural Network Classifier", "book": "Advances in Neural Information Processing Systems", "page_first": 668, "page_last": 675, "abstract": null, "full_text": "668 \n\nDembo, Siu and Kailath \n\nComplexity of Finite Precision \n\nNeural Network Classifier \n\nAmir Dembo1 \nInform. Systems Lab. \nStanford University \nStanford, Calif. 94305 \n\nKai-Yeung Siu \n\nInform. Systems Lab. \nStanford University \nStanford, Calif. 94305 \n\nThomas Kailath \nInform. Systems Lab. \nStanford University \nStanford, Calif. 94305 \n\nABSTRACT \n\nA rigorous analysis on the finite precision computational <)Spects of \nneural network as a pattern classifier via a probabilistic approach \nis presented. Even though there exist negative results on the capa(cid:173)\nbility of perceptron, we show the following positive results: Given \nn pattern vectors each represented by en bits where e > 1, that are \nuniformly distributed, with high probability the perceptron can \nperform all possible binary classifications of the patterns. More(cid:173)\nover, the resulting neural network requires a vanishingly small pro(cid:173)\nportion O(log n/n) of the memory that would be required for com(cid:173)\nplete storage of the patterns. Further, the perceptron algorithm \ntakes O(n2) arithmetic operations with high probability, whereas \nother methods such as linear programming takes O(n3 .5 ) in the \nworst case. We also indicate some mathematical connections with \nVLSI circuit testing and the theory of random matrices. \n\n1 \n\nIntroduction \n\nIt is well known that the percept ron algorithm can be used to find the appropriate \nparameters in a linear threshold device for pattern classification, provided the pat(cid:173)\ntern vectors are linearly separable. Since the number of parameters in a perceptron \nis significantly fewer than that needed to store the whole data set, it is tempting to \n\n1 The coauthor is now with the Mathematics and Statistics Department of Stanford University. \n\n\fComplexity of Finite Precision Neural Network Classifier \n\n669 \n\nconclude that when the patterns are linearly separable, the perceptron can achieve \na reduction in storage complexity. However, Minsky and Papert [1] have shown \nan example in which both the learning time and the parameters increase exponen(cid:173)\ntially, when the perceptron would need much more storage than does the whole list \nof patterns. \nWays around such examples can be explored by noting that analysis that assumes \nreal arithmetic and disregards finite precision aspects might yield misleading results. \nFor example, we present below a simple network with one real valued weight that \ncan simulate all possible classifications of n real valued patterns into k classes, \nwhen unlimited accuracy and continuous distribution of the patterns are assumed. \nFor simplicity, let us assume the patterns are real numbers in [0,1]. Consider the \nfollowing sequence {xi,i} generated by each pattern Xi for i = 1, ... , n: \n\nXi,l = k\u00b7 Xi modk \n\nXi,i = k . xi,i-l mod k lor j > 1 \n\nU(Xi,j) = [xi,i) \n\nwhere [] denotes the integer part. \nLet I: {Xl, ... , Xn} --+ {O, ... , k-l} denote the desired classification of the patterns. \nIt is easy to see that for any continuous distribution on [0,1], there exists a j such \nthat U(Xi,j) = I(xi), with probability one. So, the network y = u(x,w) may \nsimulate any classification with w = j determined from the desired classification as \nshown above. \n\nSo in this paper, we emphasize the finite precision computational aspects of pattern \nclassification problems and provide partial answers to the following questions: \n\n\u2022 Can the perceptron be used as an efficient form of memory'? \n\n\u2022 Does the 'learning' time of perceptron become too long to be practical most of \n\nthe time even when the patterns are assumed to be linearly separable '? \n\n\u2022 How do the convergence results compare to those obtained by solving system \n\nof linear inequalities'? \n\nWe attempt to answer the above questions by using a probabilistic approach. The \ntheorems will be presented without proofs; details of the proof will appear in a \ncomplete paper. In the following analysis, the phrase 'with high probability' means \nthe probability of the underlying event goes to 1 as the number of patterns goes to \n\n\f670 \n\nDembo, Siu and Kailath \n\ninfinity. First, we shall introduce the classical model of a perceptron in more details \nand give some known results on its limitation as a pattern classifier. \n\n2 The Perceptron \nA perceptron is a linear threshold device which computes a linear combination of \nthe coordinates of the pattern vector, compares the value with a threshold and \noutputs +1 or -1 if the value is larger or smaller than the threshold respectively. \nMore formally, we have \nOutput: \n\nsign{ < w, i > -8} = sign{L Xi . Wi - 8} \n\nd \n\ni=l \n\nInput: \n\nParameters: \n\nweights \n\nthreshold 8 E R \n\nsign{y} = { ~~ if y ~ 0 \notherwise \n\nGiven m patterns xi, ... ,x~ in Rd, there are 2m possible ways of classifying each \nof the patterns to \u00b1 1. When a desired classification of the patterns is achieveable \nby a perceptron, the patterns are said to be linearly separable. Rosenblatt(1962) \n[2] showed that if the patterns are linearly separable, then there is a 'learning' \nalgorithm which he called perceptron learning algorithm to find the appropriate pa(cid:173)\nrameters wand 8. Let CTi = \u00b11 be the desired classification of the pattern xi. Also, \nlet Yi = CTi \u2022 xi. The perceptron learning algorithm runs as follows: \n\n1. Set k = 1, choose an initial value of w( k) \u00a5 O. \n2. Select an i E {I, ... , n}, set Y(k) = yi. \n3. If w( k) . y( k) ~ 0, goto 2. Else \n4. Set w(k + 1) = w(k) + Y(k), k = k + 1, go to 2. \n\n\fComplexity of Finite Precision Neural Network Classifier \n\n671 \n\nThe algorithm terminates when step 3 is true for all Yi. If the patterns are lin(cid:173)\nearly separable, then the above perceptron algorithm is guaranteed to converge in \nfinitely many iterations, i.e. Step 4 would be reached only finitely often. \n\nThe existence of such simple and elegant 'learning' algorithm had brought a great \ndeal of interests during the 60's. However, the capability of the perceptron is very \nlimited since only a small portion of the 2m possible binary classifications can be \nachieved. In fact, Cover(1965) [3] has shown that a perceptron can at most classify \nthe patterns into \n\n2\n\ndI:-1 \n\ni=O \n\n( \n\nm - 1 = O(md - 1 ) \n\n) \n\nI \n\ndifferent ways out of the 2m possibilities. \nThe above upper bound O( m d- 1 ) is achieved when the pattern vectors are in general \nposition i.e. every subset of d vectors in {xi, ... , x~} are linearly independent. An \nimmediate generalization of this result is the following: \n\nTheorem 1 For any function f( w, i) which lies in a function space of dimension \nr, i. e. if we can write \n\nf(w,i) = al (w)!t (i) + ... + ar(w)fr(i) \n\nthen the number of possible classifications of m patterns by sign{f(w, in is bounded \n\nby O(mr-l) \n\n3 A New Look at the Perceptron \nThe reason why perceptron is so limited in its capability as a pattern classifier is \nthat the dimension of the pattern vector space is kept fixed while the number of \npatterns is increased. We consider the binary expansion of each coordinate and view \nthe real pattern vector as a binary vector, but in a much higher dimensional space. \nThe intuition behind this is that we are now making use of every bit of information \nin the pattern. Let us assume that each pattern vector has dimension d and that \neach coordinate is given with m bits of accuracy, which grows with the number of \npatterns n in such a way that d\u00b7 m = c\u00b7 n for some c > 1. By considering the binary \nexpansion, we can treat the patterns as binary vectors, i.e. each vector belongs to \n{+l,-lyn. If we want to classify the patterns into k classes, we can use logk \nnumber of binary classifiers, each classifying the patterns into the corresponding bit \nof the binary encoding of the k classes. So without loss of generality, we assume \nthat the number of classes equals 2. Now the classification problem can be viewed \nas an implementation of a partial Boolean function whose value is only specified on \n\n\f672 \n\nDem bo, Siu and Kailath \n\nn inputs out of the 2cn possible ones. For arbitrary input patterns, there does not \nseem to exist an efficient way other than complete storage of the patterns and the \nuse of a look-up table for classification, which will require O(n2) bits. It is natural \nto ask if this is the best we can do. Surprisingly, using probabilistic method in \ncombinatorics [4] (counting arguments), we can show the following: \n\nTheorem 2 For n sufficiently large, there exists a system that can simulate all \npossible binary classifications with parameter storage of n + 2 log n bits. \n\nMoreover, a recent result from the theory of VLSI testing [5], implies that at least \nn + log n bits are needed . As the proof of theorem 1 is non-constructive, both \nthe learning of the parameters and the retrieval of the desired classification in the \n'optimal' system may be too complex for any practical purpose. Besides, since \nthere is almost no redundancy in the storage of parameters in such an 'optimal' \nsystem, there will be no 'generalization' properties. \ni.e. It is difficult to predict \nwhat the output of the system would be on patterns that are not trained. However, \na perceptron classifier, while sub-optimal in terms of Theorem 3 below, requires \nonly O(n log n) bits for parameter storage, compared with O(n2 ) bits for a table \nlook up classifier. In addition, it will exhibit 'generalization' properties in the sense \nthat new patterns that are close in Hamming distance to those trained patterns are \nlikely to be classified into the same class. So, if we allow some vanishingly small \nprobability of error, we can give an affirmative answer to the first question raised \nat the beginning: \n\nTheorem 3 Assume the n pattern vectors are uniformly distributed over {+1, _1}cn, \nthen with high probability, the patterns can be classified into a1l2n possible ways us(cid:173)\ning perceptron algorithm. Further, the storage of parameters requires only O( n log n) \nbits. \n\nIn other words, when the input patterns are given with high precision, perceptron \ncan be used as an efficient form of memory. \n\nThe known upper bound on the learning time of percept ron depends on the max(cid:173)\nimum length of the input pattern vectors, and the minimum distance fJ of the \npattern vectors to a separating hyperplane. In the following analysis, our proba(cid:173)\nbilistic assumption guarantees the pattern vectors to be linearly independent with \nhigh probability and thus linearly separable. In order to give an probabilistic upper \nbound on the learning time of the perceptron, we first give a lower bound on the \nminimum distance fJ with high probability: \n\nLemma 1 Let n be the number of pattern vectors each in Rm, where m = (1 + f)n \nand f is any constant> O. Assume the entries of each vector v are iid random \nvariables with zero mean and bounded second moment. Then with probability --+ 1 \n\n\fComplexity or Finite Precision Neural Network Classifier \n\n673 \n\nas n --+ 00 , there exists a separating hyperplane and a 15* > 0 such that each vector \nis at a distance of at least 15* from it. \n\nIn our case, each coordinate of the patterns is assumed to be equally likely \u00b11 \nand clearly the conditions in the above lemma are satisfied. In general, when the \ndimension of the pattern vectors is larger than and increases linearly with the num(cid:173)\nber of patterns, the above theorem applies provided the patterns are given with \nhigh enough precision that a continuous distribution is a sufficiently good model for \nanalysis. \n\nThe above lemma makes use of a famous conjecture from the theory of random \nmatrices [6] which gives a lower bound on the minimum singular value of a random \nmatrix. We actually proved the conjecture during our course of study, which states \nwhich states that the minimum singular value of a en by n random matrix with \nc> 1, grows as Fn almost surely. \n\nTheorem 4 Let An be a en X n random matrix with c > 1, whose entries are i. i. d. \nentries with zero mean and bounded second moment, 0'\"(-) denote the minimum sin(cid:173)\ngular value of a matrix. Then there exists f3 > 0 such that \n\nlim inf u( A~) > f3 \n\nn-oo \n\nyn \n\nwith probability 1. \n\nNote that our probabilistic assumption on the patterns includes a wide class of dis(cid:173)\ntributions, in particular the zero mean normal and symmetric uniform distribution \non a bounded interval. In addition, they satisfy the following condition: \n\n(*) There exists a a> 0 such that P{[v[ > aFn} --+ 0 as n --+ 00. \n\nBefore we answer the last two questions raised at the beginning, we state the fol(cid:173)\nlowing known result on the perceptron algorithm as a second lemma: \n\nLemma 2 Suppose there exists a unit vector w* such that w* . v > 15 for some \n15 > 0 and for all pattern vectors v. Then the perceptron algorithm will converge to \na solution vector in ::::; N2 /152 number of iterations, where N is the maximum length \nof the pattern vectors. \n\nN ow we are ready to state the following \n\nTheorem 5 Suppose the patterns satisfy the probabilistic assumptions stated in \n\n\f674 \n\nDembo, Siu and Kailath \n\nLemma 1 and the condition (*), then with high probability, the perceptron takes \nO( n 2 ) arithmetic operations to terminate. \n\nAs mentioned earlier, another way of finding a separating hyperplane is to solve \na system of linear inequalities using linear programming, which requires O( n 3 .S ) \narithmetic operations [7] . Under our probabilistic assumptions, the patterns are \nlinearly independent with high probability, so that we can actually solve a system \nof linear equations. However, this still requires O(n3 ) arithmetic operations. Fur(cid:173)\nther, these methods require batch processing in the sense that all patterns have to \nbe stored in advance in order to find the desired parameters, in constrast to the \nsequential 'learning' nature of the perceptron algorithm. So for training this neural \nnetwork classifier, perceptron algorithm seems more preferable. \n\nWhen the number of patterns is polynomial in the total number of bits representing \neach pattern, we may first extend each vector to a dimension at least as large as \nthe number of patterns, and then apply the perceptron to compress the storage of \nparameters. One way of adding these extra bits is to form products of the coordi(cid:173)\nnates within each pattern. Note that by doing so, the coordinates of each pattern \nare pairwise independent. We conjecture that theorem 3 still applies, implying even \nmore reduction in storage requirements. Simulation results strongly support our \nconjecture. \n\n4 Conclusion \nIn this paper, the finite precision computational aspects of pattern classification \nproblems are emphasized. We show that the perceptron, in contrast to common be(cid:173)\nlief, can be quite efficient as a pattern classifier, provided the patterns are given with \nhigh enough precision. Using a probabilistic approach, we show that the percep(cid:173)\ntron algorithm can even outperform linear programming under certain conditions. \nDuring the course of this work, we also discovered some mathematical connections \nwith VLSI circuit testing and the theory of random matrices. In particular, we \nhave proved an open conjecture regarding the minimum singular value of a random \nmatrix. \n\nAcknowledgements \n\nThis work was supported in part by the Joint Services Program at Stanford Uni(cid:173)\nversity (US Army, US Navy, US Air Force) under Contract DAAL03-88-C-OOll, \nand NASA Headquarters, Center for Aeronautics and Space Information Sciences \n(CASIS) under Grant NAGW-419-S5. \n\n\fComplexity or Finite Precision Neural Network Classifier \n\n675 \n\nReferences \n[1] M. Minsky and S. Papert, Perceptrons, The MIT Press, expanded edition, 1988. \n\n[2] F. Rosenblatt, Principles of Neurodynamics, Spartan Books, New York, 1962. \n\n[3] T. M. Cover, \"Geometrical and Statistical Properties of Systems of Linear In-\nequalities with Applications in Pattern Recognition\", IEEE Trans. on Electronic \nComputers, EC-14:326-34, 1965. \n\n[4] P. Erdos and J. Spencer, Probabilistic Methods in Combinatorics, Academic \n\nPress/ Akademiai Kiado, New York-Budapest, 1974. \n\n[5] G. Seroussi and N. Bshouty, \"Vector Sets for Exhaustive Testing of Logic Cir(cid:173)\n\ncuits\", IEEE Trans. Inform. Theory, IT-34:513-522, 1988. \n\n[6] J. Cohen, H. Kesten and C. Newman, editor, Random Matrices and Their Ap(cid:173)\n\nplications, volume 50 of Contemporary Mathematics, American Mathematical \nSociety, 1986. \n\n[7] N. Karmarkar, \"A New Polynomial-Time Algorithm for Linear Programming\", \n\nCombinatorica 1, pages 373-395, 1984. \n\n\f", "award": [], "sourceid": 277, "authors": [{"given_name": "Amir", "family_name": "Dembo", "institution": null}, {"given_name": "Kai-Yeung", "family_name": "Siu", "institution": null}, {"given_name": "Thomas", "family_name": "Kailath", "institution": null}]}