{"title": "Designing Linear Threshold Based Neural Network Pattern Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 811, "page_last": 817, "abstract": null, "full_text": "Designing Linear Threshold Based Neural \n\nNetwork Pattern Classifiers \n\nTerrence L. Fine \nSchool of Electrical Engineering \nCornell University \nIthaca, NY 14853 \n\nAbstract \n\nThe three problems that concern us are identifying a natural domain of \npattern classification applications of feed forward neural networks, select(cid:173)\ning an appropriate feedforward network architecture, and assessing the \ntradeoff between network complexity, training set size, and statistical reli(cid:173)\nability as measured by the probability of incorrect classification. We close \nwith some suggestions, for improving the bounds that come from Vapnik(cid:173)\nChervonenkis theory, that can narrow, but not close, the chasm between \ntheory and practice. \n\n1 Speculations on Neural Network Pattern Classifiers \n\nThe goal is to provide rapid, reliable classification of new inputs from a \n(1) \npattern source. Neural networks are appropriate as pattern classifiers when the \npattern sources are ones of which we have little understanding, beyond perhaps a \nnonparametric statistical model, but we have been provided with classified samples \nof features drawn from each of the pattern categories. Neural networks should be \nable to provide rapid and reliable computation of complex decision functions. The \nissue in doubt is their statistical response to new inputs. \n\n(2) The pursuit of optimality is misguided in the context of Point (1). Indeed, it \nis unclear what might be meant by 'optimality' in the absence of a more detailed \nmathematical framework for the pattern source. \n\n(3) The well-known, oft-cited 'curse of dimensionality' exposed by Richard Bell(cid:173)\nman may be a 'blessing' to neural networks. Individual network processing nodes \n(e.g., linear threshold units) become more powerful as the number of their inputs \nincreases. For a large enough number n of points in an input space of d dimensions, \nthe number of dichotomies that can be generated by such a node grows exponen(cid:173)\ntially in d. This suggests that, unlike all previous efforts at pattern classification \nthat required substantial effort directed at the selection of low-dimensional feature \nvectors so as to make the decision rule calculable, we may now be approaching a \n\n811 \n\n\f812 \n\nFine \n\nposition from which we can exploit raw data (e.g., the actual samples in a time \nseries or pixel values in an image). Even if we are as yet unable to achieve this, \nit is clear from the reports on actual pattern classifiers that have been presented \nat NIPS90 and the accompanying Keystone Workshop that successful neural net(cid:173)\nwork pattern classifiers have been constructed that accept as inputs feature vectors \nhaving hundreds of components (e.g., Guyon, et al. [1990]). \n\n(4) The blessing of dimensionality is not granted if there is either a large subset \nof critically important components that will force the network to be too complex \nor a small subset that contains almost all of the information needed for accurate \ndiscrimination. The network is liable to be successful in those cases where the input \nor feature vector ~ has components that are individually nearly irrelevant, although \ncollectively they enable us to discriminate well. Examples of such feature vectors \nmight be the responses of individual fibers in the optic nerve, a pixel array for \nan image of an alphanumeric character, or the set of time samples of an acoustic \ntransient. No one fiber, pixel value, or time sample provides significant information \nas to the true pattern category, although all of them taken together may enable \nus to do nearly error-free classification. An example in which all components are \ncritically important is the calculation of parity. On our account, this is the sort of \nproblem for which neural networks are inappropriate, albeit it has been repeatedly \nestablished that they can calculate parity. \n\n\\Ve interpret 'critically important' very weakly as meaning that the subspace \nspanned by the subset of critically important features/inputs needs to be pa.rti(cid:173)\ntioned by the classifier so that there is at least one bounded region. If the nodes are \nlinear threshold units then to carve out a bounded region, minimally a simplex, in \na subspace of dimension c, where c is the size of the subset of critically important \ninputs, will require a network having at least c + 1 nodes in the first layer. \n(5) Neural networks have opened up a new application domain wherein in practice \nwe can intelligently construct nonlinear pattern classifiers characterized by thou(cid:173)\nsands of parameters. In practice, nonlinear statistical models, ones not defined in \nterms of a covariance matrix, seem to be restricted to a few parameters. \n\n(6) Nonetheless, Occam's Razor advises us to be sparing of parameters. Vle \nshould be particularly cautious about the problem of overfitting when the number \nof parameters in the network is not much less than the number of training samples. \nTheory needs to provide practice with better insight and guidelines for avoiding \noverfitting and for the use of restrictions on training time as a guard against over(cid:173)\nfitting a system with almost as many adjustable parameters as there are data points. \n\n(7) \nPoints (1) and (5) combine to suggest that analytical approaches to network \nperformance evaluation based upon typical statistical ideas may either be difficult \nto carry out or yield conclusions of little value to practice. There is no mismatch \nbetween statistical theory and neural networks in principle, but there does seem to \nbe a significant mismatch in practice. While we are usually dealing with thousands \nof training samples, the complexity of the network means that we are not in a regime \nwhere asymptotic analyses (large sample behavior) will prove informative. On the \nother hand, the problem is far to complex to be resolved by 'exact' small sample \nanalyses. These considerations serve to validate the widespread use of simulation \nstudies to assess network design and performance. \n\n\fDesigning Linear Threshold Based Neural Network Pattern Classifiers \n\n813 \n\n2 The QED Architecture \n\n2.1 QED Overview \n\nOne may view a classifier as either making the decision as to the correct class \nor as providing 'posterior' probabilities for the various classes. If we adopt the \nlatter approach, then the use of sigmoidal units having a continuum of responses is \nappropriate. If, however, we adopt the first approach, then we require hard-limiting \ndevices to select one of only finitely many (in our case only two) pattern classes. \nThis is the approach that we adopt and it leads us to reliance upon linear threshold \nunits (LTUs). \n\nWe have focused our attention upon a flexible architecture consisting of a first \nhidden layer that is viewed as a quantizer of the input feature vector ~ and is \ntherefore referred to as the Q-Iayer. The binary outputs from the Q-Iayer are then \ninput to a second hidden layer whose function is to expand the dimension of the set \nof Q-Iayer outputs. The E-Iayer enables us to exploit the blessing of dimensionality \nin that by choosing it wide enough we can ensure that all Boolean functions of the \nbinary outputs of the Q-Iayer are now implementable as linearly separable funct.ions \nof the E-Iayer outputs. Hence, to implement a binary classifier we need a t.hird layer \nconsisting of only a single node to effect the desired decision, and this output layer is \nreferred to as the D-Iayer. The layers taken together are called a QED architecture. \n\n2.2 Constructing the Q-Layer \n\nThe first layer in a feedforward neural network having LTUs can always be viewed \nas a quantizer. Subsequent layers in the network only see the input ~ through the \nwindow provided by the first layer quantization. We do not expect to be able to \nquantize/partition the input space, say Rd for large d, into many small compact \nregions; to do so would require that m > > d, as noted in Point (4) of the preceding \nsection. Hence, asymptotic results drawn from deterministic approximation theory \nare unlikely to be helpful here. One might have recourse to the large literature on \nvector quantization (e.g., the special issue on quantization of the IEEE Transac(cid:173)\ntions on Information Theory, March 1982), but we expect to quantize vectors of \nhigh dimension into a relatively small number of regions. Most of the information(cid:173)\ntheoretic literature on vector quantization does not address this domain of very low \ninformation rate (bits/coordinate). A more promising direction is that of clustering \nalgorithms (e.g., k-means as in Pollard [1982]' Darken and Moody [1990]) to guide \nthe choice of Q-Iayer. \n\n2.3 Constructing the E,D-Layers \n\nSpace limitations prevent us from detailed discussion of the formation of the E,D \nlayers. In brief, the E-Iayer can be composed of 2m , often fewer, nodes where the \nweights to the ith node from the m Q-Iayer nodes are a binary representation of the \nindex i with '0' replaced by '-1 '. No training is required for the E-Iayer. The desired \nD-Iayer responses of 0 or 1 are formed simply by assigning weight t to connections \nfrom E-Iayer nodes corresponding to input patterns from class t, and summing and \nthresholding at 1/2. The training set. T must be consulted to determine, say, on \n\n\f814 \n\nFine \n\nthe basis of majority rule, the category t E {O, I} to assign to a given E-Iayer node. \n\n2.4 The Width of the Q-Layer \n\nThe overall complexity of the QED net depends upon the number m of nodes in \nthe Q-Iayer. Hence, our proposal will only be of practical interest if m need not \nbe large. As a first argument concerning the size of this parameter, if m ~ d \nthen m hyperplanes in general position partition R d into 2m regions/cells. These \ncells are only of interest to us if we know how to assign them to pattern classes. \nFrom the perspective of Point (1) in the preceding section, we can only determine \na classification of a cell if we have classified data points lying in the cell. Thus, \nif we wish to make rational use of m nodes in the Q-Iayer, then we should have \nin excess of 2m data points in our training set. If we have fewer data points in T \nthen we will be generating a multitude of cells about whose categorization we know \nno more than that provided by possibly known prior class probabilities. Another \nestimate of the required sample size is obtained by assuming that data points are \nplaced at random in the cells. In this case results summarized and improved on in \nFlatto [1982] suggest that we will need in excess of m2m points to have a reasonable \nproba.bility of having all cells occupied by data points. Many of the experimental \nstudies reported at the meeting and workshops of NIPS90 assumed training set \nsizes no larger than about 10,000, implying that we need not consider m in excess \nof about 10. This number of nodes still yields a tractable QED architecture. \n\nA second argument on which to base an a priori determination of m can be made \nby considering the problem-average performance analyses carried out by Hughes \n[1968]. He found that the probability of correct classification for a randomly selected \nclassification problem, with equal prior probabilities for selecting a class, varied with \nthe number M of possible feature values as ~~::::;. This conclusion would suggest \nthat a Q-Iayer containing as few as five properly selected nodes would suffice (Point \n(2)) for the design of a good pattern classifier. \n\nIn any event, both of our arguments suggest that a QED net having no more than \nabout 10 Q-Iayer nodes might be adequate for many applications. At worst we would \nhave to contemplate about 1,000 nodes in the E-Iayer, and this is not a prohibitively \nlarge number given current directions in hardware development. Nonetheless, the \ncontradiction between our suggestions and current practice suggests that our con(cid:173)\nclusions are only tentative, and they need to be explored through applications, \nsimulations, and studies of statistical generalization ability. \n\n3 Sketch of Vapnik-Chervonenkis Theory of Statistical \n\nGeneralization \n\nWe assume that there are two pattern classes labelled by t E {O, I}. A pattern \nsample is reduced by a preprocessor to a feature vector ~ E Rd. Point (3) expresses \nthe goal of having this reduction be significantly less than would be required by \nan approach that does not use neural networks. Neural networks are generica.lly \n{O, I}, 1](~) = t. N = {1]} denotes the family of networks \nlabelled by 1] : Rd -\ndescribed by an architecture. As above, m denotes the width of the first hidden \n\n\fDesigning Linear Threshold Based Neural Network Pattern Classifiers \n\n815 \n\nlayer, and M denotes the number of cells/regions into which a net in N can partition \nRd. Typically, M = 2m. The training set T = {(~,ti),i = 1,n}. We hypothesize \nthat the elements of Tare i.i.d. as P(~, t), which is unknown. \nPerformance is measured by error probabilities, \n\nA good (it need not be unique) net in the family N is \n\nE(1]) = P(1](\u00a3) # t). \n\n1]0 = argminl1 EAI'E(71), \n\nE{1]\u00b0) = minE(1]). \n\nl1EN \n\n\u00a3B denotes the Bayes error probability calculated on the basis of P(\u00a3, t). \nThe empirical error frequency liT ( 1]) sustained by net 1] applied to T is \n\nIIT(1]) = - L 11](\u00a3;) - til\u00b7 \n\n1 n \nn. ,=1 \n\nA net in N having good classification performance on the training set T is \n\n1]* = argmi~EAI'\"T(1]). \n\nBy definition, \n\nLet mAl'( n) denote the VC growth function- the maximum, taken over all sets of n \npoints in the input space, of the number of subsets that can be generated by the \nclassification functions in N. Let V AI' denote the VC capacity, the largest n for \nwhich N can generate all 2n of the subsets of some such set of n points. \nFor n > VAl' , Vapnik-Chervonenkis theory (Vapnik [1982]' Pollard [1984], Baum \nand Haussler [1989]) can be adapted to yield the VC upper bound \n\nP(E(1]*) - E(1]\u00b0) ~ () ~ 6(2~t~ e-nf2/16 = 6e V,N\" log2n-logV,N\" !-n!2/ 16 \u2022 \n\nAI'. \n\nLet nc denote the critical value of sample size n for which the exponent first becomes \nnegative. If n < nc then the upper bound will exceed unity and be uninformative. \nHowever, if n > nc then the upper bound will converge to zero exponentially fast \nin sample size. An approximate solution for nc from the VC upper bound yields \n\n32e \nnc ~ 2\" VAl' (log -2 + log log -2 ). \n\n32e \n\n16 \n\n( \n\n( \n\n( \n\nIf for purposes of illustration we take ( = .1, VAl' = 50, then we find that nc ~ \n902,000. This conclusion, obtained by a direct application of Vapnik-Chervonenkis \ntheory, disagrees by orders of magnitude with the experience of practitioners gained \nin training such low-complexity networks (about 50 connections). \n\n4 Tightening the VC Argument \n\nThere are several components of the derivation of VC bounds that involve approx(cid:173)\nimations and these, therefore, can be sources for improving these bounds. These \n\n\f816 \n\nFine \n\napproximations include recourses to Chernoff/Hoeffding bounds, union bounds, es(cid:173)\ntimatesofm,N'(n), and the relation between &(1/*)-&(1/0) and 2supf/ 1117(1/)-&(1])1. \nThere is a belief among members of the neural network community that the weak(cid:173)\nness of the VC argument lies in the fact that by dealing with all possible underlying \ndistributions l' it is dealing with the worst case, and this worst case forces the large \nsample sizes. We agree with all but the last part of this belief. VC arguments being \nindependent of the choice of l' do indeed have to deal with worst cases. However, \nthe worst case is dealt with through recourse to Chernoff/Hoeffding inequalities, \nand it is easily shown that these inequalities are not the source of our difficulties. \nA more promising direction in which to seek realistic estimates of training set size \nis through reductions in m,N'( n) achieved through constraints on the architecture \nN. One such restriction is through training time bounds that in effect restrict the \nportion of N that can be explored. Two other restrictions are discussed below. \n\n5 Restricting the Architecture \n\n5.1 Parameter Quantization \n\nWe can control the growth function contribution by quantizing all network pa(cid:173)\nrameters to k bits and thereby restricting N. The VC dimension of a LTU with \nparameters quantized to k ~ 1 bits equals the VC dimension of the LTU with real(cid:173)\nvalued parameters. Hence, VC arguments show no improvement. However, there \nare now only 2 km(d+l) distinct first layers of m nodes accepting vectors from Rd. \nHence, there are no more than 2 2m +km(d+l) QED nets, and the restricted N has \nonly finitely many members. \n\nDirect application of the union bound and Chernoff inequality yield \n\n1'(&(1/*) - &(1/0) ~ i) ~ 22+2m+km(d+l)e-nf2/2. \n\nWhen i = .1, m = 5, d = 10 this bound becomes less than unity for n > nc = \n4710 + 7625k. Thus, even I-bit quantization suggests a training sample size 111 \nexcess of 4700 for reliable generalizat.ion of even this simple network. \n\n5.2 Clustering \n\nThe growth function m,N'(n) 'overest.imates' the number of cases we need to be \nconcerned with in dealing with the random variable Z(1]) = 1117(1/) - 117 1 (1/)1 en(cid:173)\ncountered in VC theory derivations. \\Ve are only interested in whether Z exceeds \na prescribed precision level f, and not whether, say, Z(1/d differs from Z(172) by as \nlittle as ~ due to 1]2 disagreeing with 1/1 at only a single sample point. \nTo enforce consideration of networks as being different only if they yield classifica(cid:173)\ntions of T disagreeing substantially with each other we might proceed by clustering \nthe points in T into I\\, clusters for each of the two classes. We then train the network \nso that decision boundaries do not subdivide individual clusters (see also Devroye \nand Wagner [1979]). The union bound and Chernoff inequality yield \n\n1'(&(1]*) - &(1/0) ~ f) ~ 22+4\"e-nf2/2, \n\na result that is independent of the input dimension d. \n\n\fDesigning Linear Threshold Based Neural Network Pattern Classifiers \n\n817 \n\nIf we again choose ( = .1 then the sample size n required to make this upper bound \nless than unity is about 280 + 560K. For accuracy at the precision level { we should \nexpect to have K ~ 1/(. Hence, the least acceptable sample size should exceed \n5,880. If we hope to make full use of the capabilities of the net, then we should \nexpect to have clusters in almost all of the 2m cells. If we take this to mean that \n21>: = 2m , then n > 9,240 for m = 5. If clusters were equally likely to fall into each \nof the M cells then we would require M (log M + Q') clusters for a probability of no \nempty cell being approximately e-e- a (e.g., Flatto [1982]). Roughly, for m = 5 we \nshould then aim for 21>: = 110 and a sample size exceeding 31,000. Large as this \nestimate is, it is still a factor of 30 below what a direct application of VC theory \nyields. \n\nAcknowledgements \n\nI wish to thank Thomas W. Parks for insightful remarks on severa.l of the topics \ndiscussed above. \n\nThis paper was prepared with partial support from DARPA through AFOSR-90-\n0016A. \n\nReferences \n\nBaum, E., D. Haussler [1989], What size net gives valid generalization?, in D. \nTouretzky, ed., Advances in Neural Information Processing Systems 1, Morgan \nKaufman Pub., 81-90. \nDarken, C., J. Moody [1990], Fast adaptive k-means clustering, NIPS90. \nDevroye, L., T. Wagner [1979], Distribution-free bounds with the resubstitution \nerror estimate, IEEE Trans. on Information Theory, IT-25, 208-210. \nFlatto, L. [1982], Limit theorems for some random variables associated with urn \nmodels, Annals of Probability, 10, 927-934. \nGuyon, I., P. Albrecht, Y. Le Cun, J. Denker, W. Hubbard [1990], Design of a neural \nnetwork character recognizer for a t.ouch terminal, listed as to appear in Pattern \nRecognition, presented orally by Le Cun at the 1990 Keystone Workshop. \n\nHughes, G. [1968], On the mean accuracy of statistical pattern recognizers, IEEE \nTrans. on Information Theory, 14, 55-63. \nPollard, D. [1982], A central limit theorem for k-means clustering, Annals of Prob(cid:173)\nability, 10, 919-926. \nPollard, D. [1984], Convergence of Stochastic Processes, Springer Verlag. \n\nVapnik, V. [1982], Estimation of Dependences Based on Empirical Data, Springer \nVerlag. \n\n\f", "award": [], "sourceid": 391, "authors": [{"given_name": "Terrence", "family_name": "Fine", "institution": null}]}