{"title": "A Neural Network for Feature Extraction", "book": "Advances in Neural Information Processing Systems", "page_first": 719, "page_last": 726, "abstract": null, "full_text": "A Neural Network for Feature Extraction \n\n719 \n\nA Neural Network for Feature Extraction \n\nNathan Intrator \n\nDiv. of Applied Mathematics, and \n\nCenter for Neural Science \n\nBrown University \n\nProvidence, RI 02912 \n\nABSTRACT \n\nThe paper suggests a statistical framework for the parameter esti(cid:173)\nmation problem associated with unsupervised learning in a neural \nnetwork, leading to an exploratory projection pursuit network that \nperforms feature extraction, or dimensionality reduction. \n\n1 \n\nINTRODUCTION \n\nThe search for a possible presence of some unspecified structure in a high dimen(cid:173)\nsional space can be difficult due to the curse of dimensionality problem, namely \nthe inherent sparsity of high dimensional spaces. Due to this problem, uniformly \naccurate estimations for all smooth functions are not possible in high dimensions \nwith practical sample sizes (Cox, 1984, Barron, 1988). \n\nRecently, exploratory projection pursuit (PP) has been considered (Jones, 1983) as a \npotential method for overcoming the curse of dimensionality problem (Huber, 1985), \nand new algorithms were suggested by Friedman (1987), and by Hall (1988, 1989). \nThe idea is to find low dimensional projections that provide the most revealing \nviews of the full-dimensional data emphasizing the discovery of nonlinear effects \nsuch as clustering. \n\nMany of the methods of classical multivariate analysis turn out to be special cases \nof PP methods. Examples are principal component analysis, factor analysis, and \ndiscriminant analysis. The various PP methods differ by the projection index opti(cid:173)\nmized. \n\n\f720 \n\nIntrator \n\nNeural networks seem promising for feature extraction, or dimensionality reduction, \nmainly because of their powerful parallel computation. Feature detecting functions \nof neurons have been studied in the past two decades (von der Malsburg, 1973, Nass \net al., 1973, Cooper et aI., 1979, Takeuchi and Amari, 1979). It has also been shown \nthat a simplified neuron model can serve as a principal component analyzer (Oja, \n1982). \n\nThis paper suggests a statistical framework for the parameter estimation problem \nassociated with unsupervised learning in a neural network, leading to an exploratory \nPP network that performs feature extraction, or dimensionality reduction, of the \ntraining data set. The formulation, which is similar in nature to PP, is based on \na minimization of a cost function over a set of parameters, yielding an optimal \ndecision rule under some norm. First, the formulation of a single and a multiple \nfeature extraction are presented. Then a new projection index (cost function) that \nfavors directions possessing multimodality, where the multimodality is measured \nin terms of the separability property of the data, is presented. This leads to the \nsynaptic modification equations governing learning in Bienenstock, Cooper, and \nMunro (BCM) neurons (1982). A network is presented based on the multiple feature \nextraction formulation, and both, the linear and nonlinear neurons are analysed. \n\nSINGLE FEATURE EXTRACTION \n\n2 \nWe associate a feature with each projection direction. With the addition of a \nthreshold function we can say that an input posses a feature associated with that \ndirection if its projection onto that direction is larger than the threshold. In these \nterms, a one dimensional projection would be a single feature extraction. \n\nThe approach proceeds as follows: Given a compact set of parameters, define a \nfamily of loss functions, where the loss function corresponds to a decision made by \nthe neuron whether to fire or not for a given input. Let the risk be the averaged \nloss over all inputs. Minimize the risk over all possible decision rules, and then \nminimize the risk over the parameter set. In case the risk does not yield a meaningful \nminimization problem, or when the parameter set over which the minimization takes \nplace can be restricted by some a-priori knowledge, a penalty, i.e. a measure on the \nparameter set, may be added to the risk. \nDefine the decision problem (11, Fo, P, L, A), where 11 = (x(1), ... , x(n\u00bb), x(i) E R N , \nis a fixed set of input vectors, (11, Fo, P) the corresponding probability space, A = \n{O, I} the decision space, and {Le }eEBM, Le : 11 x A t---+ R is the family of loss \nfunctions. BM is a compact set in RM. Let 1) be the space of all decision rules. \nThe risk Re : 1) t---+ R, is given by: \n\nn \n\nRe(c5) = L P(x(i\u00bb)Le(x(i), c5(x(i))). \n\ni=l \n\nFor a fixed 8, the optimal decision c5e is chosen so that: \n\n(2.1) \n\n(2.2) \n\n\fA Neural Network for Feature Extraction \n\n721 \n\nSince the minimization takes place over a finite set, the minimizer exists. In par(cid:173)\nticular, for a given XCi) the decision 88(x(i\u00bb) is chosen so that L8(X(i),88(x(i\u00bb)) < \nL8(x(i), 1- 88(x(i\u00bb)). \nNow we find an optimal B that minimizes the risk, namely, B will be such that: \n\n(2.3) \n\nThe minimum with respect to f} exits since BM is compact. \n\nR8(88) becomes a function that depends only on f}, and when f} represents a vector \nin RN, R8 can be viewed as a projection index. \n\n3 MULTI-DIMENSIONAL FEATURE EXTRACTION \nIn this case we have a single layer network of interconnected units, each performing \na single feature extraction. All units receive the same input and the interaction be(cid:173)\ntween the units is via lateral inhibition. The formulation is similar to single feature \nextraction, with the addition of interaction between the single feature extractors. \nLet Q be the number of features to be extracted from the data. The multiple de(cid:173)\ncision rule 88 = (8~1), ... ,8~Q\u00bb) takes values in A = {0,1}Q. The risk of node k \nis given by: R~k)(8) = l::=l P(x(i\u00bb)L~k)(x(i), 8(k)(x(i\u00bb)), and the total risk of the \nnetwork is R8(8) = l:~=l R~k)(8). Proceeding as before, we can minimize over the \ndecision rules 8 to get 88 , and then minimize over f} to get B, as in equation (2.3). \nThe coupling of the equations via the inhibition, and the relation between the \ndifferent features extracted is exhibited in the loss function for each node and will \nbecome clear through the next example. \n\n4 FINDING THE OPTIMAL f) FOR A SPECIFIC LOSS \n\nFUNCTION \n\n4.1 A SINGLE BCM NEURON - ONE FEATURE EXTRACTION \n\nIn this section, we present an exploratory PP method with a specific loss function. \nThe differential equations performing the optimization turn out to be a good ap(cid:173)\nproximation of the low governing synaptic weight modification in the BCM theory \nfor learning and memory in neurons. The formal presentation of the theory, and \nsome theoretical analysis is given in (Bienenstock, 1980, Bienenstock et al., 1982), \nmean field theory for a network based on these neurons is presented in (Scofield \nand Cooper, 1985, Cooper and Scofield, 1988), more recent analysis based on the \nstatistical viewpoint is in (Intrator 1990), computer simulations and the biological \nrelevance are discussed in (Saul et al., 1986, Bear et al., 1987, Cooper et al., 1988). \n\nWe start with a short review of the notations and definitions of BCM theory. \nConsider a neuron with input vector x = (Xl, ... , XN), synaptic weights vector \nm = (ml' ... , mN), both in RN , and activity (in the linear region) c = X . m. \n\n\f722 \n\nIntrator \n\nDefine em = E[(x\u00b7 m)2], \u00a2(e, em) = e2 - jeem , 4>(e, em) = e2 - ~eem. The input \nx, which is a stochastic process, is assumed to be of Type II t.p mixing, bounded, and \npiecewise constant. The t.p mixing property specifies the dependency of the future \nof the process on its past. These assumptions are needed for the approximation of \nthe resulting deterministic equation by a stochastic one and are discussed in detail \nin (Intrator, 1990). Note that e represents the linear projection of x onto m, and \nwe seek an optimal projection in some sense. \nThe BCM synaptic modification equations are given by: m = JL(t)4>(x . m, em)x, \nm(O) = mo, where JL(t) is a global modulator which is assumed to take into account \nall the global factors affecting the cell, e.g., the beginning or end of the critical \nperiod, state of arousal, etc. \nRewriting the modification equation as m = JL(t)(x . m)(x . m - ~!1m)X, we see \nthat unlike a classical Hebb-Stent rule, the threshold !1m is dynamic. This gives \nthe modification equation the desired stability, with no extra conditions such as \nsaturation of the activity, or normalization of II m II, and also yields a statistically \nmeaningful optimization. \nReturning to the statistical formulation, we let !1 = m be the parameter to be \nestimated according to the above formulation and define an appropriate loss function \ndepending on the cell's decision whether to fire or not. The loss function represents \nthe intuitive idea that the neuron will fire when its activity is greater than some \nthreshold, and will not otherwise. We denote the firing of the neuron by a = 1. \nDefine K = -JL JJe ... \u00a2(s, em)ds. Consider the following loss function: \n\n... \n\nL8(X, a) = Lm(x, a) = \n\n.l(z.m) A( \n\n... .l{z.m) A \n\n4> s, em s, \n\n)d \n\n4>(s, em)ds, \n\n-JL e \nK - JL e... \n-JL e... \nK \n\n- JL 0... \n\n.l(z.m) A ( \n\n4> s, em s, \n\n)d \n\n.l{z.m) A ( \n\n4> s, em s, \n\n)d \n\n_ m, \n\n(x\u00b7 m) > e \na=1 \n(x\u00b7 m) < em, a=1 \n(x\u00b7 m) ~ em, a=O \n(x\u00b7 m) > em, a=O \n\nIt follows from the definition of L8 and from the definition of 68 in (2.2) that \n\nLm(x, 6m) = -JL {(z.m) \u00a2(s, em)ds = - JL {(x. m)3 - E[(x . m)2](x . m)2} \n\nJe ... \n\n3 \n\n(4.1) \n\n( 4.2) \n\nThe above definition of the loss function suggests that the decision of a neuron \nwhether to fire or not is based on a dynamic threshold (x . m) > em. It turns out \nthat the synaptic modification equations remain the same if the decision is based \non a fixed threshold. This is demonstrated by the following loss function, which \nleads to the same risk as in equation (4.3): K = -JL Joje ... \u00a2(s, em)ds, \n\nL8(X, a) = Lm(x, a) = \n\ns, \n\n)d \n\n4> s, em \n\n((z.m) A( \nJL J~z .m) \u00a2(s, em)ds, \n((z.m) A( \n\n4> s, em s, \n(z.m) A _ ) \n\n-JL Jo \nK -\n-JL Jo \nK - JL Jo \n\n4>(s, em ds, \n\n)d \n\n(x . m) ~ 0, a = 1 \n(x\u00b7 m) < 0, a = 1 \n(x\u00b7 m) ~ 0, a = 0 \n(x. m) > 0, a = 0 \n\n( 4.1') \n\n\fA Neural Network for Feature Extraction \n\n723 \n\nThe risk is given by: \n\nThe following graph represents the \u00a2 function and the associated loss function \nLm(x, 6m) of the activity c. \n\nTHE 4> FUNCTION \n\nTHE LOSS FUNCTION \n\n(4.3) \n\nFig. 1: The Function \u00a2 and the Loss Functions for a Fixed m and em. \n\nFrom the graph of the loss function it follows that for any fixed m and em, the loss \nis small for a given input x, when either x . m is close to zero or negative, or when \nx . m is larger than em. This suggests, that the preferred directions for a fixed 8m \nwill be such that the projected single dimensional distribution differs from normal \nin the center of the distribution, in the sense that it has a multi-modal distribution \nwith a distance between the two peaks larger than 8m \u2022 Rewriting (4.3) we get \n\nRe(6e) __ !!.. E[(x\u00b7 m)3] _ 1 \n\nE2[(x . m)2] -\n\n3 {E2[(x . m)2] \n\n}. \n\n(4.4) \n\nThe term E[(x.m)3]/E2[(x.m)2] can be viewed as some measure of the skewness of \nthe distribution, which is a measure of deviation from normality and therefore an \ninteresting direction (Diaconis and Friedman, 1984), in accordance with Friedman \n(1987) and Hall's (1988, 1989) argument that it is best to seek projections that \ndiffer from the normal in the center of the distribution rather than in the tails. \n\nSince the risk is continuously differentiable, its minimization can be done via the \ngradient descent method with respect to m, namely: \n\n( 4.5) \n\nNotice that the resulting equation represents an averaged deterministic equation \nof the stochastic BCM modification equations. It turns out that under suitable \nconditions on the mixing of the input x and the global function IL, equation (4.5) is \na good approximation of its stochastic version. \n\nWhen the nonlinearity of the neuron is emphasized, the neuron's activity is then \ndefined as c = 0'( X \u2022 m), where 0' usually represents a smooth sigmoidal function. \nem is then defined as E[0'2(x . m)], and the loss function is similar to the one \ngiven by equation (4.1) except that (x\u00b7 m) is replaced by O'(x, m). The gradient of \n\n\f724 \n\nIntrator \n\nthe risk is given by: -VTn.Rnt(8m) = J-LE[\u00a2(O'(x, m), E>m ) 0\" x], where 0\" represents \nthe derivative of 0' at the point (x . m). Note that 0' may represent any nonlinear \nfunction, e.g. radial symmetric kernels. \n\n4.2 THE NETWORK - MULTIPLE FEATURE EXTRACTION \n\nIn this case we have Q identical nodes, which receive the same input and inhibit \neach other. Let the neuronal activity be denoted by Ck = X \u2022 mk. We define the \ninhibited activity Ck = Ck -\n11 Eitk ci' and the threshold e~ = E[cil. In a more \ngeneral case, the inhibition may be defined to take into account the spatial location \nof adjacent neurons, namely, Ck = Ei Aikci, where Aik represents different types \nof inhibitions, e.g. Mexican hat. Since the following calculations are valid for both \nkinds of inhibition we shall introduce only the simpler one. \n\nThe loss function is similar to the one defined in a single feature extraction with the \nexception that the activity C = X\u00b7 m is replaced by C. Therefore the risk for node k is \ngiven by: Rk = -~{E[c~] - (E[ci])2}, and the total risk is given by R = E~=l Rk. \nThe gradient of R is given by: \n\n8R = -J-L[I-11(Q - l)lE[\u00a2(ck,e~)x]. \n8mk \n\n(4.6) \n\nEquation (4.6) demonstrates the ability of the network to perform exploratory pro(cid:173)\njection pursuit in parallel, since the minimization of the risk involves minimization \nof nodes 1, ... , Q, which are loosely coupled. \n\nThe parameter 11 represents the amount of lateral inhibition in the network, and \nis related to the amount of correlation between the different features sought by \nthe network. Experience shows that when 11 ~ 0, the different units may all be(cid:173)\ncome selective to the simplest feature that can be extracted from the data. When \nl1(Q - 1) ~ 1, the network becomes selective to those inputs that are very far apart \n(under the l2 norm), yielding a classification of a small portion of the data, and \nmostly unresponsiveness to the rest of the data. When 0 < 11( Q - 1) < 1, the net(cid:173)\nwork becomes responsive to substructures that may be common to several different \ninputs, namely extract invariant features in the data. The optimal value of 11 has \nbeen estimated by data driven techniques. \n\nWhen the non linearity of the neuron is emphasized the activity is defined (as in \nthe single neuron case) as Ck = 0'( X \u2022 mk)' Ck, e~, and Rk are defined as before. In \nthis case :!: = -l1O\"(X' mi)x, :!kk = O\"(x\u00b7 mk)x, and equation (4.6) becomes: \n\n4.3 OPTIMAL NETWORK SIZE \n\nA major problem in network solutions to real world problems is optimal network \nsize. In our case, it is desirable to try and extract as many features as possible on \n\n\fA Neural Network for Feature Extraction \n\n725 \n\none hand, but it is clear that too many neurons in the network will simply inhibit \neach other, yielding sub-optimal results. The following solution was adopted: We \nreplace each neuron in the network with a group of neurons which all receive the \nsame input, and the same inhibition from adjacent groups. These neurons differ \nfrom one another only in their initial synaptic weights. The output of each neuron \nis replaced by the average group activity. Experiments show that the resulting \nnetwork is more robust to noise and outliers in the data. Furthermore, it is observed \nthat groups that become selective to a true feature in the data, posses a much \nsmaller inter-group variance of their synaptic weight vector than those which do \nnot become responsive to a coherent feature. We found that eliminating neurons \nwith large inter-group variance and retraining the network, may yield improved \nfeature extraction properties. \n\nThe network has been applied to speech segments, in an attempt to extract some \nfeatures from CV pairs of isolated phonemes (Seebach and Intrator, 1988). \n\n5 DISCUSSION \nThe PP method based on the BCM modification function, has been found capable of \neffectively discovering non linear data structures in high dimensional spaces. Using \na parallel processor and the presented network topology, the pursuit can be done \nfaster than in the traditional serial methods. \n\nThe projection index is based on polynomial moments, and is therefore computa(cid:173)\ntionally attractive. When only the nonlinear structure in the data is of interest, a \nsphering transformation (Huber, 1981, Friedman, 1987), can be applied first to the \ndata for removal of all the location, scale, and correlational structure from the data. \n\nWhen compared with other PP methods, the highlights of the presented method are \ni) the projection index concentrates on directions where the separability property as \nwell as the non-normality of the data is large, thus giving rise to better classification \nproperties; ii) the degree of correlation between the directions, or features extracted \nby the network can be regulated via the global inhibition, allowing some tuning of \nthe network to different types of data for optimal results; iii) the pursuit is done on \nall the directions at once thus leading to the capability of finding more interesting \nstructures than methods that find only one projection direction at a time. iv) the \nnetwork's structure suggests a simple method for size-optimization. \n\nAcknowledgements \n\nI would like to thank Professor Basilis Gidas for many fruitful discussions. \n\nSupported by the National Science Foundation, the Office of Naval Research, and \nthe Army Research Office. \n\nReferences \nBarron A. R. (1988) Approximation of densities by sequences of exponential families. \nSubmitted to Ann. Statist. \n\n\f726 \n\nIntrator \n\nBienenstock E. L. (1980) A theory of the development of neuronal selectivity. Doctoral \ndissertation, Brown University, Providence, RI \nBienenstock E. L., L. N Cooper, and P. W. Munro (1982) Theory for the development \nof neuron selectivity: orientation specificity and binocular interaction in visual cortex. \nJ.Neurosci. 2:32-48 \nBear M. F., L. N Cooper, and F. F. Ebner (1987) A Physiological Basis for a Theory of \nSynapse Modification. Science 237:42-48 \nCooper L. N, and F. Liberman, and E. Oja (1979) A theory for the acquisition and loss \nof neurons specificity in visual cortex. Bioi. Cyb. 33:9-28 \nCooper L. N, and C. L. Scofield (1988) Mean-field theory of a neural network. Proc. Natl. \nAcad. Sci. USA 85:1973-1977 \nCox D. D. (1984) Multivariate smoothing spline functions. SIAM J. Numer. Anal. 21 \n789-813 \nDiaconis P., and D. Freedman (1984) Asymptotics of Graphical Projection Pursuit. The \nAnnals of Statistics, 12 793-815. \nFriedman J. H. (1987) Exploratory Projection Pursuit. Journal of the American Statistical \nAssociation 82-397:249-266 \nHall P. (1988) Estimating the Direction in which Data set is Most Interesting. Probab. \nTheory ReI. Fields 80, 51-78 \nHall P. (1989) On Polynomial-Based Projection Indices for Exploratory Projection Pursuit. \nThe Annals of Statistics, 17,589-605. \nHuber P. J. (1981) Projection Pursuit. Research Report PJH-6, Harvard University, Dept. \nof Statistics. \nHuber P. J. (1985) Projection Pursuit. The Annal. of Stat. 13:435-475 \nIntrator N. (1990) An Averaging Result for Random Differential Equations. In Press. \nJones M. C. (1983) The Projection Pursuit Algorithm for Exploratory Data Analysis. \nUnpublished Ph.D. dissertation, University of Bath, School of Mathematics. \nvon der Malsburg, C. (1973) Self-organization of orientation sensitivity cells in the striate \ncortex. Kybernetik 14:85-100 \nNass M. M., and L. N Cooper (1975) A theory for the development of feature detecting \ncells in visual cortex. Bioi. Cybernetics 19:1-18 \nOja E. (1982) A Simplified Neuron Model as a Principal Component Analyzer. J. Math. \nBiology, 15:267-273 \nSaul A., and E. E. Clothiaux, 1986) Modeling and Simulation III: Simulation of a Model for \nDevelopment of Visual Cortical specificity. J . of Electrophysiological Techniques, 13:279-\n306 \nScofield C. L., and L. N Cooper (1985) Development and properties of neural networks. \nContemp. Phys. 26:125-145 \nSeebach B. S., and N. Intrator (1988) A learning Mechanism for the Identification of \nAcoustic Features. (Society for Neuroscience). \nTakeuchi A., and S. Amari (1979) Formation of topographic maps and columnar mi(cid:173)\ncrostructures in nerve fields. Bioi. Cyb. 35:63-72 \n\n\f", "award": [], "sourceid": 244, "authors": [{"given_name": "Nathan", "family_name": "Intrator", "institution": null}]}