{"title": "Non-Linear Dimensionality Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 580, "page_last": 587, "abstract": null, "full_text": "N on-Linear Dimensionality Reduction \n\nDavid DeMers\u00b7 & Garrison CottreUt \nDept. of Computer Science & Engr., 0114 \n\nInstitute for Neural Computation \nUniversity of California, San Diego \n\n9500 Gilman Dr. \n\nLa Jolla. CA, 92093-0114 \n\nAbstract \n\nA method for creating a non-linear encoder-decoder for multidimensional data \nwith compact representations is presented. The commonly used technique of \nautoassociation is extended to allow non-linear representations, and an objec(cid:173)\ntive function which penalizes activations of individual hidden units is shown \nto result in minimum dimensional encodings with respect to allowable error in \nreconstruction. \n\n1 \n\nINTRODUCTION \n\nReducing dimensionality of data with minimal information loss is important for feature \nextraction, compact coding and computational efficiency. The data can be tranformed \ninto \"good\" representations for further processing, constraints among feature variables \nmay be identified, and redundancy eliminated. Many algorithms are exponential in the \ndimensionality of the input, thus even reduction by a single dimension may provide valuable \ncomputational savings. \n\nAutoassociating feed forward networks with one hidden layer have been shown to extract \nthe principal components of the data (Baldi & Hornik, 1988). Such networks have been \nused to extract features and develop compact encodings of the data (Cottrell, Munro & \nZipser, 1989). Principal Components Analysis projects the data into a linear subspace \n\n-email: demers@cs.ucsd.edu \nt email: gary@cs.ucsd.edu \n\n580 \n\n\fNon-Linear Dimensionality Reduction \n\n581 \n\nNon-Linear \n\n\"Principal Componant.\" Nat \n\nAuto-a.soclator \n\na a a a a \n\nOutput \n\nDecoding layer \n\nHldcMn layer \n\"bottleneck\" \n\nEncoding layer \n\nInput \n\nfr 000 \nfr \nfr \na 0\\)0 a \n\n000 \n\n000 \n\nFigure 1: A network capable of non-linear lower dimensional representations of data. \n\nwith minimum information loss, by multiplying the data by the eigenvectors of the sample \ncovariance matrix. By examining the magnitude of the corresponding eigenvalues one can \nestimate the minimum dimensionality of the space into which the data may be projected \nand estimate the loss. However, if the data lie on a non-linear submanifold of the feature \nspace, then Principal Components will overestimate the dimensionality. For example, the \ncovariance matrix of data sampled from a helix in R3 will have full-rank and thus three \nprincipal components. However, the helix is a one-dimensional manifold and can be \n(smoothly) parameterized with a single number. \n\nThe addition of hidden layers between the inputs and the representation layer, and between \nthe representation layer and the outputs provides a network which is capable of learning \nnon-linear representations (Kramer, 1991; Oja, 1991; Usui, Nakauchi & Nakano, 1991). \nSuch networks can perform the non-linear analogue to Principal Components Analysis, \nand extract \"principal manifolds\". Figure 1 shows the basic structure of such a network. \nHowever, the dimensionality of the representation layer is problematic. Ideally, the dimen(cid:173)\nsionality of the encoding (and hence the number of representation units needed) would be \ndetermined from the data. \n\nWe propose a pruning method for determining the dimensionality of the representation. A \ngreedy algorithm which successively eliminates representation units by penalizing variances \nresults in encodings of minimal dimensionality with respect to the allowable reconstruction \nerror. The algOrithm therefore performs non-linear dimensionality reduction (NLDR). \n\n\f582 \n\nDeMers and Cottrell \n\n2 DIMENSIONALITY ESTIMATION BY REGULARIZATION \n\nThe a priori assignment of the number of units for the representation layer is problematic. \nIn order to achieve maximum data compression, this number should be as small as possible; \nhowever, one also wants to preserve the information in the data and thus encode the data with \nminimum error. If the intrinsic dimensionality is not known ahead of time (as is typical), \nsome method to estimate the dimensionality is desired. Minimization of the variance of a \nrepresentation unit will essentially squeeze the variance of the data into the other hidden \nunits. Repeated minimization results in ip..crf'..asingly lower-dimensional representation. \nMore formally, let the dimensionality of the raw data be n. We wish to find F and \nits approximate inverse such that Rn ~ RP ~1 Rn where p < n. Let y denote the ~ \ndimensional vector whose elements are the p univalued functions Ii which make up F. If one \nof the component functions Ii is always constant, itis not contributing to theautoassociation \nand can be eliminated, yielding a function F with p - 1 components. A constant value for \nIi means that the variance of Ii over the data is zero. We add a regularization term to the \nobjective function penalizing the variance of one of the representation units. If the variance \ncan be driven to near zero while simultaneously achieving a target error in the primary task \nof autoassociation, then the unit being penalized can be pruned. \nLet Hp = Ap(~f=l (hp(neti ) - E(hp(neti \u00bb)2) where neti is the net inputto the unit given \nthe jth training pattern, hp (neti ) is the activation of the pth hidden unit in the representation \nlayer (the one being penalized) and E is the expectation operator. For notational clarity, \nthe superscripts will be suppressed hereafter. E(hi(xi)) can be estimated as hp, the mean \nactivation of hp over all patterns in the training data. \n\n8Hp _ 8Hp 8netp _ 2 \\ (h _ h- )h' \n8Wp l \n\nBnetp 8Wpl \n\nAp P \n\n-\n\n-\n\nP \n\npOl \n\nwhere h~ is the derivative of the activation function of unit hp with respect to its input, and \n0' is the output of the lth unit in the preceding layer. Let 8p = 2Ap h~ (hp - hp). We simply \nadd Dp to the delta of hp due to backpropagation from the output layer. \n\nWe first train a multi-Iayerl network to learn the identity map. When error is below a user(cid:173)\nspecified threshold, Ai is increased for the unit with lowest variance. Ifnetwork weights can \nbe found2 such that the variance can be reduced below a small threshold while the remaining \nunits are able to encode the data, the hidden unit in question is no longer contributing to the \nautoencoding, and its connections are excised from the network. The process is repeated \nuntil the variance of the unit in question cannot be reduced while maintaining low error. \n\nIThere is no reason to suppose that the encoding and decoding layers must be of the same size. \nIn fact, it may be that two encoding or decoding layers will provide superior performance. For the \nhelix example. the decoder had two hidden layers and linear connections from the representation to \nthe output, while the encoder had a single layer. Kramer (1991) uses information theoretic measures \nfor choosing the size of the encoding and decoding layers; however, only a fixed representation layer \nand equal encoding and decoding layers are used. \n\n2Unbounded weights will allow the same amount of information to pass through the layer with \narbitrarily small variance and using arbitrarily large weights. Therefore the weights in the network \nmust be bounded. Weight vectors with magnitudes larger than 10 are renormalized after each epoch. \n\n\fNon-Linear Dimensionality Reduction \n\n583 \n\nFigure 2: The original 3-D helix data plus reconstructionfrom a single parameter encoding. \n\n3 RESULTS \n\nWe applied this method to several problems: \n\n1. a closed I-D manifold in R3. \n2. a I-D helix in R3. \n3. Time series data generated from the Mackey-Glass delay-differential equation. \n4. 160 64 by 64 pixel, 8-bit grayscale face images. \n\nA number of parameter values must be chosen; error threshold, maximum magnitude of \nweights, value of Ai when increased, and when to \"give up\" training. For these experiments, \nthey were chosen by hand; however, reasonable values can be selected such that the method \ncan be automated. \n\n3.1 Static Mappings: Circle and Helix \n\nThe first problem is interesting because it is known that there is no diffeomorphism from \nthe circle to the unit interval. Thus (smooth) single parameter encodings cannot cover the \nentire circle, though the region of the circle left unparameterized can be made arbitrarily \nsmall. Depending on initial conditions, our technique found one of three different solutions. \nSome simulations resulted in a two-dimensional representation with the encodings lying \non a circle in R2. This is a failure to reduce the dimensionality. The other solutions were \nboth I-D representations; one \"wrapping\" the unit interval around the circle, the other \n\"splitting\" the interval into two pieces. The initial architecture consisted of a single 8-unit \nencoding layer and two 8-unit decoding layers. T} was set to 0.01, L\\A to 0.1, and the error \nthreshold, t:, to 0.001. \n\nThe helix problem is interesting because the data appears to be three-dimensional to PCA. \nNLDR consistentl y finds an invertible one-dimensional representation of the data. Figure 2 \n\n\f584 \n\nDeMers and Cottrell \n\n1r---~----~--~----~---'----~--~~--~----r----' \n\n&.1.9n&1 -\n\nenaod.i. n 9 -----\n......... . \n.nood \n\n9 \n\n0 ... \n\n0.8 \n\n0.7 \n\n0 ... \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\no~--~----~--~----~--~----~--~~--~----~--~ \n200 \n400 \n\n320 \n\n380 \n\n280 \n\n300 \n\n340 \n\n360 \n\n220 \n\n240 \n\n260 \n\nFigure 3: Data/rom the Mackey-Glass delay-differential equation with T = 17, correlation \ndimension 2.1, and the reconstructed signal encoded in two and three dimensions. \nshows the original data, along with the network\u00bbs output when the representation layer was \nstimulated with activation ranging from 0.1 to 0.9. The training data were mapped into the \ninterval 0.213 - 0.778 using a single (sigmoidal) representation unit. The initial architecture \nconsisted of a single 10-unit encoding layer and two 10-unit decoding layers. 7J was set to \n0.01,~'\\ to 0.1, and the error threshold, f, to 0.001. \n\n3.2 NLDR Applied to Time Series \n\nThe Mackey-Glass problem consists of estimation of the intrinsic dimensionality of a \nscalar signal. Classically\u00bb such time series data is embedded in a space of \"high enough\" \ndimension sllch that one expects the geometric invariants to be preserved. However, this \nmay significantly overestimate the number of variables needed to describe the data. 1\\vo \ndifferent series were examined; parameter settings for the Mackey-Glass equation were \nchosen such that the intrinsic dimensionality is 2.1 and 3.5. The data was embedded in a \nhigh dimensional space by the standard technique of recoding as vectors of lagged data. A \n3 dimensional representation was found for the 2.1 dimensional data and a 4 dimensional \nrepresentation was found for the 3.5 dimensional data. Figure 3 shows the original data and \nits reconstruction for the 2.1 dimensional data. Allowing higher reconstruction error resulted \nin a 3 dimensional representation for the 3.5 dimensional data, effectively smoothing the \noriginal signal (DeMers, 1992). Figure 4 shows the original data and its reconstruction for \nthe 3.5 dimensional data. The initial architecture consisted of a two 1 O-unit encoding layers \nand two lO-unit decoding layers, and a 7 -unit representation layer. The representation layer \n7J was set to 0.01\u00bb ~,\\ to 0.1, and the error \nwas connected directly to the output layer. \nthreshold\u00bb E, to 0.001. \n\n3.3 Faces \n\nThe face image data is much more challenging. The face data are 64 x 64 pixel, 8-bit \ngrayscale images taken from (Cottrell & Metcalfe\u00bb 1991), each of which can be considered \nto be a point in a 4,096 dimensional \"pixel space\". The question addressed is whether \nNLDR can find low-dimensional representations of the data which are more useful than \nprincipal components. The data was preprocessed by reduction to the first 50 principal \n\n\fNon-Linear Dimensionality Reduction \n\n585 \n\nO_~r-----~----~----~~----~----~----~------~----, \n\n4.D R_con.truat!.on_ .Z\"Z'or bound 0.002 -----\n4D A..cOon_t r u a t i.on. _ r ro r bound 0.0004 .----. \n\nM_ak_y-Gl. \u2022 \u2022 \u2022 \u2022 i.9\"_l. -\n\no d' \n\n0_7 \n\n0_\", \n\n0_4 \n\n0_3 \n\n\"\"0 \n\n\"00 \n\nFigure4: Datafrom the Mackey-Glass delay-differential equation with T = 35, correlation \ndimension 3.5, and the reconstructed signal encoded in four dimensions with two different \nerror thresholds. \n\ncomponents3 of the images. These reduced representations were then processed further \nby NLDR. The architecture consisted of a 30-unit encoding layer and a 30-unit decoding \nlayer, and an initial representation layer of 20 units. There were direct connections from the \nrepresentation layer to the output layer. TJ was 0.05, ~,\\ was 0.1 and f was 0.001. NLDR \nfound a five-dimensional representation. Figure 5 shows four of the 160 images after \nreduction to the first 50 principal components (used as training) and the same images after \nreconstruction from a five dimensional encoding. We are unable to determine whether the \ndimensions are meaningful; however, experiments with the decoder show that points inside \nthe convex hull of the representations project to images which look like faces. Figure 6 \nshows the reconstructed images from a linear interpolation in \"face space\" between the two \nencodings which are furthest apart. \n\nHow useful are the representations obtained from a training set for identification and \nclassification of other images of the same subjects? The 5-D representations were used \nto train a feedforward network to recognize the identity and gender of the subjects, as in \n(Cottrell & Metcalfe, 1991). 120 images were used in training and the remaining 40 used \nas a test set. The network correctly identified 98% of the training data subjects, and 95% \non the test set. The network achieved 95% correct gender recognition on both the training \nand test sets. The misclassified subject is shown in Figure 7. An informal poll of visitors \nto the poster in Denver showed that about 2/3 of humans classify the subject as male and \n1/3 as female. \n\nAlthough NLDR resulted in five dimensional encodings of the face data, and thus super(cid:173)\nficially compresses the data to approximately 55 bits per image or 0.013 bits per pixel, \nthere is no data compression. Both the decoder portion of the network and the eigenvectors \nused in the initial processing must also be stored. These amortize to about 6 bits per pixel, \nwhereas the original images require only 1.1 bits per pixel under run-length encoding. In \norder to achieve data compression, a much larger data set must be obtained in order to find \nthe underlying human face manifold. \n\n350 was chosen by eyeballing a graph of the eigenvalues for the point at which they began to \n\n\"flatten\"; any value between about 40 and 80 would be reasonable. \n\n\f586 \n\nDeMers and Cottrell \n\nFigure 5: Four of the originalface images and their reconstruction after encoding asftve \ndimensional data. \n\nFigure 6: The two images with 5-D encodings which are the furthest apart. and the \nreconstructions of four 5-D points equally spaced along the line joining them. \n\nFigure 7: UPat\" .. the subject whose gender afeedforward network classified incorrectly. \n\n\fNon-Linear Dimensionality Reduction \n\n587 \n\n4 CONCLUSIONS \n\nA method for automatically generating a non-linear encoder/decoder for high dimensional \ndata has been presented. The number of representation units in the final network is an \nestimate of the intrinsic dimensionality of the data. The results are sensitive to the choice \nof error bound, though the precise relationship is as yet unknown. The size of the encoding \nand decoding hidden layers must be controlled to avoid over-fitting; any data set can be \nencoded into scalar values given enough resolution. Since we are using gradient search \nto solve a global non-linear optimization problem, there is no guarantee that this method \nwill find the global optimum and avoid convergen:;e to local minima. However, NLDR \nconsistently constructed low dimensional encoding!; which were decodeable with low loss. \n\nAcknowledgements \n\nWe would like to thank Matthew Turk & Alex Pentland for making their /acerec software \navailable, which was used to extract the eigenvectors of the original face data. The first \nauthor was partially supported by Fellowships from the California Space Institute and the \nMcDonnell-Pew Foundation. \n\nReferences \n\nPierre Baldi and Kurt Hornik (1988) \"Neural Networks and Principal Component Analysis: \nLearning from Examples without Local Minima\", Neural Networks 2, 53-58. \n\nGarrison Cottrell and Paul Munro (1988) \"Principal Components Analysis of Images via \nBackpropagation\", in Proc. SPIE (Cambridge, MA). \n\nGarrison Cottrell, Paul Munro, and David Zipser (1989) \"Image Compression by Backprop(cid:173)\nagation: A Demonstration of Extensional Programming\", In Sharkey, Noel (Ed.), Models \no/Cognition: A review o/Cognitive Science, vol. 1. \n\nGarrison Cottrell and Janet Metcalfe (1991) \"EMPATH -\nFace, Emotion and Gender \nRecognition using Holons\" in Lippmann, R., Moody, 1. & Touretzky, D., (eds), Advances \nin Neural Information Processing Systems 3. \n\nDavid DeMers (1992) \"Dimensionality Reduction for Non-Linear Time Series\", Neural \nand Stochastic Methods in Image and Signal Processing (SPIE 1766). \n\nMark Kramer (1991) \"Nonlinear Principal Component Analysis Using Autoassociative \nNeural Networks\", AIChE lournaI37:233-243. \n\nErkki Oja (1991) \"Data Compression, Feature Extraction, and Autoassociation in Feedfor(cid:173)\nward Neural Networks\" in Kohonen, T., Simula, O. and Kangas, 1., eds, Artificial Neural \nNetworks,737-745. \n\nShiro Usui, Shigeki Nakauchi, and Masae Nakano (1991) \"Internal Color Representation \nAcquired by a Five-Layer Neural Network\", in Kohonen, T., Simula, O. and Kangas, J., \neds, Artificial Neural Networks, 867-872. \n\n\f\fPART VII \n\nTHEORY AND \n\nANALYSIS \n\n\f\f", "award": [], "sourceid": 619, "authors": [{"given_name": "David", "family_name": "DeMers", "institution": null}, {"given_name": "Garrison", "family_name": "Cottrell", "institution": null}]}