{"title": "Viewpoint Invariant Face Recognition using Independent Component Analysis and Attractor Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 817, "page_last": 823, "abstract": null, "full_text": "Viewpoint invariant face recognition using \n\nindependent component analysis and \n\nattractor networks \n\nMarian Stewart Bartlett \n\nUniversity of California San Diego \n\nThe Salk Institute \nLa Jolla, CA 92037 \n\nmarni@salk.edu \n\nTerrence J. Sejnowski \n\nUniversity of California San Diego \nHoward Hughes Medical Institute \n\nThe Salk Institute, La Jolla, CA 92037 \n\nterry@salk.edu \n\nAbstract \n\nWe have explored two approaches to recogmzmg faces across \nchanges in pose. First, we developed a representation of face images \nbased on independent component analysis (ICA) and compared it \nto a principal component analysis (PCA) representation for face \nrecognition. The ICA basis vectors for this data set were more \nspatially local than the PCA basis vectors and the ICA representa(cid:173)\ntion had greater invariance to changes in pose. Second, we present \na model for the development of viewpoint invariant responses to \nfaces from visual experience in a biological system. The temporal \ncontinuity of natural visual experience was incorporated into an \nattractor network model by Hebbian learning following a lowpass \ntemporal filter on unit activities. When combined with the tem(cid:173)\nporal filter, a basic Hebbian update rule became a generalization \nof Griniasty et al. (1993), which associates temporally proximal \ninput patterns into basins of attraction. The system acquired rep(cid:173)\nresentations of faces that were largely independent of pose. \n\n1 \n\nIndependent component representations of faces \n\nImportant advances in face recognition have employed forms of principal compo(cid:173)\nnent analysis, which considers only second-order moments of the input (Cottrell & \nMetcalfe, 1991; Turk & Pentland 1991). Independent component analysis (ICA) \nis a generalization of principal component analysis (PCA), which decorrelates the \nhigher-order moments of the input (Comon, 1994). In a task such as face recogni(cid:173)\ntion, much of the important information is contained in the high-order statistics of \nthe images. A representational basis in which the high-order statistics are decorre(cid:173)\nlated may be more powerful for face recognition than one in which only the second \norder statistics are decorrelated, as in PCA representations. We compared an ICA(cid:173)\nbased representation to a PCA-based representation for recognizing faces across \nchanges in pose. \n\n\f818 \n\nM. S. Bartlett and T. J. Sejnowski \n\n-30\" \n\n-IS\" \n\n0\" \n\nIS\" \n\n30\" \n\nFigure 1: Examples from image set (Beymer, 1994). \n\nThe image set contained 200 images of faces, consisting of 40 subjects at each of \nfive poses (Figure 1). The images were converted to vectors and comprised the rows \nof a 200 x 3600 data matrix, X. We consider the face images in X to be a linear \nmixture of an unknown set of statistically independent source images S, where A \nis an unknown mixing matrix (Figure 2). The sources are recovered by a matrix of \nlearned filters, W, which produce statistically independent outputs, U. \n\ns \n\nX \n\nU \n\n~ X ~ X \n\n~ \n~ \n\nIII \n\nSources \n\nUnknown \nMixing \nProcess \n\nFace \nImages \n\nLearned \nWeights Outputs \n\nSeparated \n\nFigure 2: Image synthesis model. \n\nThe weight matrix, W, was found through an unsupervised learning algorithm that \nmaximizes the mutual information between the input and the output of a nonlinear \ntransformation (Bell & Sejnowski, 1995). This algorithm has proven successful for \nseparating randomly mixed auditory signals (the cocktail party problem), and has \nrecently been applied to EEG signals (Makeig et al., 1996) and natural scenes (see \nBell & Sejnowski, this volume). The independent component images contained in \nthe rows of U are shown in Figure 3. In contrast to the principal components, all 200 \nindependent components were spatially local. We took as our face representation \nthe rows of the matrix A = W- 1 which provide the linear combination of source \nimages in U that comprise each face image in X. \n\n1.1 Face Recognition Performance: leA vs. Eigenfaces \n\nWe compared the performance of the leA representation to that of the peA repre(cid:173)\nsentation for recognizing faces across changes in pose. The peA representation of a \nface consisted of its component coefficients, which was equivalent to the \"Eigenface\" \n\n\fViewpoint Invariant Face Recognition \n\n819 \n\nFigure 3: Top: Four independent components of the image set. Bottom: First four \nprincipal components. \n\nrepresentation (Turk & Pentland, 1991). A test image was recognized by assigning \nit the label of the nearest of the other 199 images in Euclidean distance. \n\nClassification error rates for the ICA and PCA representations and for the original \ngraylevel images are presented in Table 1. For the PCA representation, the best \nperformance was obtained with the 120 principal components corresponding to the \nhighest eigenvalues. Dropping the first three principal components, or selecting \nranges of intermediate components did not improve performance. The independent \ncomponent sources were ordered by the magnitude of the weight vector, row of W, \nused to extract the source from the image.1 Best performance was obtained with \nthe 130 independent components with the largest weight vectors. Performance with \nthe ICA representation was significantly superior to Eigenfaces by a paired t-test \n(0 < 0.05). \n\nMutual Information \n\nPercent Correct Recognition \n\nGraylevel Images \nPCA \nICA \n\n.89 \n.10 \n.007 \n\n.83 \n.84 \n.87 \n\nTable 1: Mean mutual information between all pairs of 10 basis images, and between the \noriginal graylevel images. Face recognition performance is across all 200 images. \n\nFor the task of recognizing faces across pose, a statistically independent basis set \nprovided a more powerful representation for face images than a principal component \nrepresentation in which only the second order statistics are decorrelated. \n\nIThe magnitude of the weight vector for optimally projecting the source onto the sloping \npart of the nonlinearity provides a measure of the variance of the original source (Tony \nBell, personal communication). \n\n\f820 \n\nM. S. Bartlett and T. 1. Sejnowski \n\n2 Unsupervised Learning of Viewpoint Invariant \n\nRepresentations of Faces in an Attractor Network \n\nCells in the primate inferior temporal lobe have been reported that respond selec(cid:173)\ntively to faces despite substantial changes in viewpoint (Hasselmo, Rolls, Baylis, & \nNalwa, 1989). Some cells responded independently of viewing angle, whereas other \ncells gave intermediate responses between a viewer-centered and an object centered \nrepresentation. This section addresses how a system can acquire such invariance to \nviewpoint from visual experience. \n\nDuring natural visual experience, different views of an object or face tend to appear \nin close temporal proximity as an animal manipulates the object or navigates around \nit, or as a face changes pose. Capturing such temporal relationships in the input \nis a way to automatically associate different views of an object without requiring \nthree dimensional descriptions. \n\nAttractor Network \n\nCoDlpetitive Hebbian Learning \n\nFigure 4: Model architecture. \n\nHebbian learning can capture these temporal relationships in a feedforward sys(cid:173)\ntem when the output unit activities are passed through a lowpass temporal filter \n(Foldiak, 1991; Wallis & Rolls, 1996). Such lowpass temporal filters have been \nrelated to the time course of the modifiable state of a neuron based on the open \ntime of the NMDA channel for calcium influx (Rhodes, 1992). We show that 1) \nthis lowpass temporal filter increases viewpoint invariance of face representations in \na feedforward system trained with competitive Hebbian learning, and 2) when the \ninput patterns to an attractor network are passed through a lowpass temporal fil(cid:173)\nter, then a basic Hebbian weight update rule associates sequentially proximal input \npatterns into the same basin of attraction. \n\nThis simulation used a subset of 100 images from Section 1, consisting of twenty \nfaces at five poses each. Images were presented to the model in sequential order as \nthe subject changed pose from left to right (Figure 4). The first layer is an energy \nmodel related to the output of V1 complex cells (Heeger, 1991). The images were \nfiltered by a set of sine and cosine Gabor filters at 4 spatial scales and 4 orientations \nat 255 spatial locations. Sine and cosine outputs were squared and summed. The \nset of V1 model outputs projected to a second layer of 70 units, grouped into two \n\n\fViewpoint Invariant Face Recognition \n\n821 \n\ninhibitory pools. The third stage of the model was an attractor network produced \nby lateral interconnections among all of the complex pattern units. The feedforward \nand lateral connections were trained successively. \n\n2.1 Competitive Hebbian learning of temporal relationships \n\nThe Competitive Learning Algorithm (Rumelhart & Zipser, 1985) was extended to \ninclude a temporal lowpass filter on output unit activities (Bartlett & Sejnowski, \n1996). This manipulation gives the winner in the previous time steps a competitive \nadvantage for winning, and therefore learning, in the current time step. \n\nwinner = maxj [y/t)] \n\ny/t) = AY} + (1 _ A)y/t-1) \n\n(1) \n\nThe output activity of unit j at time t, y/t), is determined by the trace, or running \naverage, of its activation, where y} is the weighted sum of the feedforward inputs, a \nis the learning rate, Xitl is the value of input unit i for pattern u, and 8t1 is the total \namount of input activation for pattern u. The weight to each unit was constrained \nto sum to one. This algorithm was used to train the feedforward connections. There \nwas one face pattern per time step and A was set to 1 between individuals. \n\n2.2 Lateral connections in the output layer form an attractor network \n\nHebbian learning of lateral interconnections, in combination with a lowpass tem(cid:173)\nporal filter on the unit activities in (1), produces a learning rule that associates \ntemporally proximal inputs into basins of attraction. We begin with a simple Heb(cid:173)\nbian learning rule \n\nWij = ~ L(Y~ - yO)(y} _ yO) \n\nP \n\nt=1 \n\n(2) \n\nwhere N is the number of units, P is the number of patterns, and yO is the mean \nactivity over all of the units. Replacing y~ with the activity trace y/t) defined in \n(1), substituting yO = Ayo + (1 - A)YO and multiplying out the terms leads to the \nfollowing learning rule: \n\n+k2 [(y/t-1) _ yO)(y/t-1) _ yO)] \n\n(3) \n\nh \n\nwere 1= \n\nk \n\n>'(1;>') \n\n>. \n\nd k \n\nan 2= \n\n(1_;)2 \n\n>. \n\nThe first term in this equation is basic Hebbian learning, the second term associates \npattern t with pattern t - 1, and the third term is Hebbian association of the trace \nactivity for pattern t - 1. This learning rule is a generalization of an attractor \nnetwork learning rule that has been shown to associate random input patterns \n\n\f822 \n\nM. S. Bartlett and T. 1. Sejnowski \n\ninto basins of attraction based on serial position in the input sequence (Griniasty, \nTsodyks & Amit, 1993). The following update rule was used for the activation V \nof unit i at time t from the lateral inputs (Griniasty, Tsodyks, & Amit, 1993) : \n\nVi(t + lSt) = \u00a2 [I: Wij Vj(t) - 0] \n\nWhere 0 is a neural threshold and \u00a2(x) = 1 for x > 0, and 0 otherwise. In these \nsimulations, 0 = 0.007, N = 70, P = 100, yO = 0.03, and A = 0.5 gave kl = k2 = 1. \n\n2.3 Results \n\nTemporal association in the feedforward connections broadened the pose tuning of \nthe output units (Figure 5 Left) . When the lateral connections in the output layer \nwere added, the attractor network acquired responses that were largely invariant to \npose (Figure 5 Right). \n\n[J Same Face, with Trace \no Same Face, no Trace \n-- Different Faces \n\n\u2022 Hebb plus trace \nDTesiset \n.. Griniasly 01. al. \n\u2022 Hebbonly \n\n80.8 \n:c \n,.!g \n... \n~ 0.6 \n<3 \n~0.4 \nQ) \n~0.2 \n\no,....-o_-u, \n\n-600 \n\n-450 _30\u00b0 \n\n_15\u00b0 \n\n0\u00b0 15\u00b0 \n\n30\u00b0 \n\n45\u00b0 60\u00b0 \n\n-600 _45\u00b0 _30\u00b0 _15\u00b0 \n\n0\u00b0 15\u00b0 \n\n30\u00b0 \n\n45\u00b0 60\u00b0 \n\n~Pose \n\n~Pose \n\nFigure 5: \nLeft: Correlation of the outputs of the feedforward system as a function \nof change in pose. Correlations across different views of the same face (- ) are compared \nto correlations across different faces (--) with the temporal trace parameter A = 0.5 \nand A = O. Right: Correlations in sustained activity patterns in the attractor network \nas a function of change in pose. Results obtained with Equation 3 (Hebb plus trace) are \ncompared to Hebb only, and to the learning rule in Griniasty et al. \n(1993). Test set \nresults for Equation 3 (open squares) were obtained by alternately training on four poses \nand testing on the fifth, and then averaging across all test cases. \n\nF \n5 \n10 \n20 \n20 \n\nN \n70 \n70 \n70 \n160 \n\nFIN \n.07 \n_14 \n.29 \n.13 \n\nAttractor Network \n\nleA \n\n% Correct % Correct \n\n1.00 \n.90 \n.61 \n.73 \n\n.96 \n.86 \n.89 \n.89 \n\nTable 2: Face classification performance of the attractor network for four ratios of the \nnumber of desired memories, F, to the number of units, N. Results are compared to ICA \nfor the same subset of faces. \n\n\fViewpoint Invariant Face Recognition \n\n823 \n\nClassification accuracy of the attractor network was calculated by nearest neighbor \non the activity states (Table 2). Performance of the attractor network depends both \non the performance of the feedforward system, which comprises its input, and on \nthe ratio of the number of patterns to be encoded in memory, F, to the number of \nunits, N, where each individual in the face set comprises one memory pattern. The \nattractor network performed well when this ratio was sufficiently high. The ICA \nrepresentation also performed well, especially for N=20. \n\nThe goal of this simulation was to begin with structured inputs similar to the re(cid:173)\nsponses of VI complex cells, and to explore the performance of unsupervised learn(cid:173)\ning mechanisms that can transform these inputs into pose invariant responses. We \nshowed that a lowpass temporal filter on unit activities, which has been related \nto the time course of the modifiable state of a neuron (Rhodes, 1992), cooperates \nwith Hebbian learning to (1) increase the viewpoint invariance of responses to faces \nin a feedforward system, and (2) create basins of attraction in an attractor net(cid:173)\nwork which associate temporally proximal inputs. These simulations demonstrated \nthat viewpoint invariant representations of complex objects such as faces can be \ndeveloped from visual experience by accessing the temporal structure of the input. \nAcknowledgments \nThis project was supported by Lawrence Livermore National Laboratory ISCR Agreement \nB291528, and by the McDonnell-Pew Center for Cognitive Neuroscience at San Diego. \nReferences \nBartlett, M. Stewart, & Sejnowski, T., 1996. Unsupervised learning of invariant represen(cid:173)\n\ntations of faces through temporal association. Computational Neuroscience: Int. Rev. \nNeurobio. Suppl. 1 J.M Bower, Ed., Academic Press, San Diego, CA:317-322. \n\nBeymer, D. 1994. Face recognition under varying pose. In Proceedings of the 1994 IEEE \n\nComputer Society Conference on Computer Vision and Pattern Recognition. Los Alami(cid:173)\ntos, CA: IEEE Comput. Soc. Press: 756-61. \n\nBell, A. & Sejnowski, T., (1997). The independent components of natural scenes are edge \n\nfilters. Advances in Neural Information Processing Systems 9. \n\nBell, A., & Sejnowski, T., 1995. An information Maximization approach to blind separa(cid:173)\n\ntion and blind deconvolution. Neural Compo 7: 1129-1159. \n\n36:287-314. \n\nComon, P. 1994. Independent component analysis - a new concept? Signal Processing \nCottrell & Metcalfe, 1991. Face, gender and emotion recognition using Holons. In Ad-\nvances in Neural Information Processing Systems 3, D. Touretzky, (Ed.), Morgan Kauf(cid:173)\nman, San Mateo, CA: 564 - 571. \n\nFoldiak, P. 1991. Learning invariance from transformation sequences. Neural Compo 3:194-\nGriniasty, M., Tsodyks, M., & Amit, D. 1993. Conversion of temporal correlations between \n\n200. \n\nstimuli to spatial correlations between attractors. Neural Compo 5:1-17. \n\nHasselmo M. Rolls E. Baylis G. & Nalwa V. 1989. Object-centered encoding by face(cid:173)\nselective neurons in the cortex in the superior temporal sulcus of the monkey. Experi(cid:173)\nmental Brain Research 75(2):417-29. \n\nHeeger, D. (1991). Nonlinear model of neural responses in cat visual cortex. Computational \nModels of Visual Processing, M. Landy & J. Movshon, Eds. MIT Press, Cambridge, \nMA. \n\nMakeig, S, Bell, AJ, Jung, T-P, and Sejnowski, TJ 1996. Independent component anal(cid:173)\n\nysis of Electroencephalographic data, In: Advances in Neural Information Processing \nSystems 8, 145-151. \n\nRhodes, P. 1992. The long open time of the NMDA channel facilitates the self-organization \n\nof invariant object responses in cortex. Soc. Neurosci. Abst. 18:740. \n\nRumelhart, D. & Zipser, D. 1985. Feature discovery by competitive learning. Cognitive \n\nScience 9: 75-112. \n\nTurk, M., & Pentland, A. 1991. Eigenfaces for Recognition. J. Cog. Neurosci. 3(1):71-86. \nWallis, G. & Rolls, E. 1996. A model of invariant object recognition in the visual system. \n\nTechnical Report, Oxford University Department of Experimental Psychology. \n\n\f", "award": [], "sourceid": 1226, "authors": [{"given_name": "Marian", "family_name": "Bartlett", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}