{"title": "Rate-coded Restricted Boltzmann Machines for Face Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 908, "page_last": 914, "abstract": null, "full_text": "Rate-coded Restricted Boltzmann Machines for \n\nFace Recognition \n\nVee WhyeTeh \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto M5S 2Z9 Canada \n\nywteh@cs.toronto.edu \n\nGeoffrey E. Hinton \n\nGatsby Computational Neuroscience Unit(cid:173)\n\nUniversity College London \nLondon WCIN 3AR u.K. \nhinton@ gatsby. ucl.ac. uk \n\nAbstract \n\nWe describe a neurally-inspired, unsupervised learning algorithm that \nbuilds a non-linear generative model for pairs of face images from the \nsame individual. Individuals are then recognized by finding the highest \nrelative probability pair among all pairs that consist of a test image and \nan image whose identity is known. Our method compares favorably with \nother methods in the literature. The generative model consists of a single \nlayer of rate-coded, non-linear feature detectors and it has the property \nthat, given a data vector, the true posterior probability distribution over \nthe feature detector activities can be inferred rapidly without iteration or \napproximation. The weights of the feature detectors are learned by com(cid:173)\nparing the correlations of pixel intensities and feature activations in two \nphases: When the network is observing real data and when it is observing \nreconstructions of real data generated from the feature activations. \n\n1 \n\nIntroduction \n\nFace recognition is difficult when the number of individuals is large and the test and training \nimages of an individual differ in expression, pose, lighting or the date on which they were \ntaken. In addition to being an important application, face recognition allows us to evaluate \ndifferent kinds of algorithm for learning to recognize or compare objects, since it requires \naccurate representation of fine discriminative features in the presence of relatively large \nwithin-individual variations. This is made even more difficult when there are very few \nexemplars of each individual. \n\nWe start by describing a new unsupervised learning algorithm for a restricted form of Boltz(cid:173)\nmann machine [1]. We then show how to generalize the generative model and the learning \nalgorithm to deal with real-valued pixel intensities and rate-coded feature detectors. We \nthen apply the model to face recognition and compare it to other methods. \n\n2 \n\nInference and learning in Restricted Boltzmann Machines \n\nA Restricted Boltzmann machine (RBM) [2] is a Boltzmann machine with a layer of visible \nunits and a single layer of hidden units with no hidden-to-hidden nor visible-to-visible \n\n\u00b7Correspondence address \n\n\ftime = \n\ndata \n\n0 \n\nreconstruction \n\n1 \n\nfantasy \n\n00 \n\nFigure 1: Alternating Gibbs sampling and the terms in the learning rules of a RBM. \n\nconnections. Because there is no explaining away [3], inference in an RBM is much easier \nthan in a general Boltzmann machine or in a causal belief network with one hidden layer. \nThere is no need to perform any iteration to determine the activities of the hidden units, \nas the hidden states, Sj, are conditionally independent given the visible states, Si . The \ndistribution of Sj is given by the standard logistic function: \n\np(Sj = llsi) = 1 + exp( _ Li WijSi) \n\n1 \n\n(1) \n\nConversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM \nto learn population codes in which units may be highly correlated. It is hard to do this in \ncausal belief networks with one hidden layer because the generative model of a causal \nbelief net assumes marginal independence. \n\nAn RBM can be trained using the standard Boltzmann machine learning algorithm which \nfollows a noisy but unbiased estimate of the gradient of the log likelihood of the data. One \nway to implement this algorithm is to start the network with a data vector on the visible \nunits and then to alternate between updating all of the hidden units in parallel and updating \nall of the visible units in parallel with Gibbs sampling. Figure 1 illustrates this process. If \nthis alternating Gibbs sampling is run to equilibrium, there is a very simple way to update \nthe weights so as to minimize the Kullback-Leibler divergence, QOIIQoo, between the data \ndistribution, QO, and the equilibrium distribution of fantasies over the visible units, Qoo, \nproduced by the RBM [4]: \n\n(2) \nwhere < SiSj >Qo is the expected value of SiSj when data is clamped on the visible units \nand the hidden states are sampled from their conditional distribution given the data, and \nQ~ is the expected value of SiSj after prolonged Gibbs sampling. \n\nThis learning rule does not work well because it can take a long time to approach equi(cid:173)\nlibrium and the sampling noise in the estimate of < SiSj >Q~ can swamp the gradient. \nHinton [1] shows that it is far more effective to minimize the difference between QOllQoo \nand Q111Qoo where Q1 is the distribution of the one-step reconstructions of the data that \nare produced by first picking binary hidden states from their conditional distribution given \nthe data and then picking binary visible states from their conditional distribution given the \nhidden states. The exact gradient of this \"contrastive divergence\" is complicated because \nthe distribution Q1 depends on the weights, but this dependence can safely be ignored to \nyield a simple and effective learning rule for following the approximate gradient of the \ncontrastive divergence: \n\n(3) \n\n3 Applying RBMs to face recognition \n\nFor images of faces, binary pixels are far from ideal. A simple way to increase the represen(cid:173)\ntational power without changing the inference and learning procedures is to imagine that \n\n\feach visible unit, i, has 10 replicas which all have identical weights to the hidden units. So \nfar as the hidden units are concerned, it makes no difference which particular replicas are \nturned on: it is only the number of active replicas that counts. So a pixel can now have 11 \ndifferent intensities. During reconstruction of the image from the hidden activities, all the \nreplicas can share the computation of the probability, Pi, of turning on, and then we can se(cid:173)\nlect n replicas to be on with probability (~)nPi (10 - n)(1-p;). We actually approximated \nthis binomial distribution by just adding a little Gaussian noise to lOpi and rounding. The \nsame trick can be used for the hidden units. Eq. 3 is unaffected except that Si and Sj are \nnow the number of active replicas. \n\nThe replica trick can be seen as a way of simulating a single neuron over a time interval in \nwhich it may produce multiple spikes that constitute a rate-code. For this reason we call the \nmodel \"RBMrate\". We assumed that the visible units can produce up to 10 spikes and the \nhidden units can produce up to 100 spikes. We also made two further approximations: We \nreplaced Si and Sj in Eq. 3 by their expected values and we used the expected value of Si \nwhen computing the probability of activation of the hidden units. However, we continued \nto use the stochastically chosen integer firing rates of the hidden units when computing the \none-step reconstructions of the data, so the hidden activities cannot transmit an unbounded \namount of information from the data to the reconstruction. \n\nA simple way to use RBMrate for face recognition is to train a single model on the training \nset, and to identify a face by finding the gallery image that produces a hidden activity vector \nthat is most similar to the one produced by the face. This is how eigenfaces are used for \nrecognition, but it does not work well because it does not take into account the fact that \nsome variations across faces are important for recognition, while some variations are not. \nTo correct this, we instead trained an RBMrate model on pairs of different images of the \nsame individual, and then we used this model of pairs to decide which gallery image is best \npaired with the test image. To account for the fact that the model likes some individual \nface images more than others, we define the fit between two faces hand 12 as G(h, h) + \nG(h,h) - G(h,h) - G(h,h) where the goodness score G(VI,V2) is the negative \nfree energy of the image pair VI, V2 under the model. Weight-sharing is not used, hence \nG ( VI, V2) ::p G (V2, VI). However, to preserve symmetry, each pair of images of the same \nindividual VI, V2 in the training set has a reversed pair V2, VI in the set. We trained the \nmodel with 100 hidden units on 1000 image pairs (500 distinct pairs) for 2000 iterations \nin batches of 100, with a learning rate of 2.5 x 10-6 for the weights, a learning rate of \n5 x 10-6 for the biases, and a momentum of 0.95. \n\nOne advantage of eigenfaces over correlation is that once the test image has been converted \ninto a vector of eigenface activations, comparisons of test and gallery images can be made in \nthe low-dimensional space of eigenface activations rather than the high-dimensional space \nof pixel intensities. The same applies to our face-pair network, as the goodness score of an \nimage pair is a simple function of the total input received by each hidden unit from each \nimage. The total inputs from each gallery image can be precomputed and stored, while the \ntotal inputs from a test image only needs to be computed once for comparisons with all \ngallery images. \n\n4 The FERET database \n\nOur version of the FERET database contained 1002 frontal face images of 429 individuals \ntaken over a period of a few years under varying lighting conditions. Of these images, 818 \nare used as both the gallery and the training set and the remaining 184 are divided into four \ndisjoint test sets: \n\nThe .6.expression test set contains 110 images of different individuals. These individuals \nall have another image in the training set that was taken with the same lighting conditions \n\n\fFigure 2: Images are normalized in five stages: a) Original image; b) Locate centers of eyes \nby hand; c) Rotate image; d) Crop image and subsample at 56 x 56 pixels; e) Mask out all \nof the background and some of the face, leaving 1768 pixels in an oval shape; f) Equalize \nthe intensity histogram; g) Some examples of processed images. \n\nat the same time but with a different expression. The training set also includes a further \n244 pairs of images that differ only in expression. \n\nThe ildays test set contains 40 images that come from 20 individuals. Each of these \nindividuals has two images from the same session in the training set and two images taken \nin a session 4 days later or earlier in the test set. A further 28 individuals were photographed \n4 days apart and all 112 of these images are in the training set. \n\nThe ilmonths test set is just like the ~days test set except that the time between sessions \nwas at least three months and different lighting conditions were present in the two sessions. \nThis set contains 20 images of 10 individuals. A further 36 images of 9 more individuals \nwere included in the training set. \n\nThe ilglasses test set contains 14 images of 7 different individuals. Each of these individ(cid:173)\nuals has two images in the training set that were taken in another session on the same day. \nThe training and test pairs for an individual differ in that one pair has glasses and the other \ndoes not. The training set includes a further 24 images, half with glasses and half without, \nfrom 6 more individuals. \n\nThe images include the whole head, parts of the shoulder, and background. Instead of \nworking with whole images, which contain much irrelevant information, we worked with \nface images that were normalized as shown in figure 2. Masking out all of the background \ninevitably looses the contour of the face which contains much discriminative information. \nThe histogram equalization step removes most lighting effects, but it also removes some \nrelevant information like the skin tone. For the best performance, the contour shape and \nskin tone would have to be used as additional sources of discriminative information. \n\n5 Comparative results \n\nWe compared RBMrate with four popular face recognition methods. The first and sim(cid:173)\nplest is correlation, which returns the similarity score as the angle between two images \nrepresented as vectors of pixel intensities. This performed better than using the Euclidean \ndistance as a score. The second method is eigenfaces [5], which first projects the images \nonto the principal component subspaces, then returns the similarity score as the angle be(cid:173)\ntween the projected images. The third method is fisherfaces [6] . Instead of projecting \nthe images onto the subspace of the principal components, which maximizes the variance \n\n\f.1.expression \n\n.1.days \n\n30 \n\n25 \n~ \ne.....20 \nKl \nT!! 15 \ne 10 \n\nCD \n\n5 \n\n0 \n\n30 \n\ncorr \n\n25 \n~ \n.?-20 \ntI) \nQ) \n\n\"\u00a7 15 \ne 10 \n\nCD \n\n5 \n\n0 \n\ncorr \n\neigen \n\nfisher \n\noppca RBMrate \n\n30 \n\n25 \n~ \n.e......20 \nKl \nT!! 15 \ne 10 \n\nCD \n\n5 \n\n100 \n\n80 \n\n_ \n~ \n0 \ntI) 60 \nQ) \nT!! \n40 g \n\nQ) \n\n20 \n\n0 \n\ncorr \n\ncorr \n\neigen \n\nfisher \n\noppca RBMrate \n\nFigure 3: Error rates of all methods on all test sets. The bars in each group correspond, \nfrom left to right, to the rank-I, rank-2, rank-4, rank-8 and rank-16 error rates. The rank-n \nerror rate is the percentage of test images where the n most similar gallery images are all \nincorrect. \n\namong the projected images, fisherfaces projects the images onto a subspace which, at the \nsame time, maximizes the between individual variances and minimizes the within individ(cid:173)\nual variances in the training set. The final method, which we shall call ()ppca, is proposed \nby Moghaddam et at [7]. This method models differences between images of the same \nindividual as a PPCA [8, 9], and differences between images of different individuals as \nanother PPCA. Then given a difference of two images, it returns as the similarity score \nthe likelihood ratio of the difference image under the two PPCA models. It was the best \nperforming algorithm in the September 1996 FERET test [10]. \n\nFor eigenfaces, we used 199 principal components, omitting the first principal component, \nas we determined manually that it encodes simply for lighting conditions. This improved \nthe recognition performances on all the test sets except for ~exp r ession . We used a \nsubspace of dimension 200 for fisherfaces, while we used 10 and 30 dimensional PPCAs \nfor the within-class and between-class model of c5ppca respectively. These are the same \nnumbers used by Moghaddam et at and gives the best results in our simulations. The num(cid:173)\nber of dimensions or hidden units used by each method was optimized for that particular \nmethod for best performance. \n\nFigure 3 shows the error rates of all five methods on the test sets. The results were averaged \nover 10 random partitions of the dataset to improve statistical significance. Correlation and \neigenfaces perform poorly on ~expre s s i o n, probably because they do not attempt to \nignore the within-individual variations, whereas the other methods do. All the models did \nvery poorly on the ~months test set which is unfortunate as this is the test set that is most \nlike real applications. RBMrate performed best on ~expre s s i o n, fisherfaces is best \non ~days and ~glasses ,while eigenfaces is best on ~months . These results show \nthat RBMrate is competitive with but do not perform better than other methods. Figure \n4 shows that after our preprocessing, human observers also have great difficulty with the \n~mo nths test set, probably because the task is intrinsically difficult and is made even \nharder by the loss of contour and skin tone information combined with the misleading oval \n\n\fFigure 4: On the left is a test image from ~mo nths and on the right are the 8 most similar \nimages returned by RBMrate . Most human observers cannot find the correct match within \nthese 8. \n\nFigure 5: Example features learned by RBMrate . Each pair of RFs constitutes a feature. \nTop half: with unconstrained weights; bottom half: with non-negative weight constraints. \n\ncontour produced by masking out all of the background. \n\n6 Receptive fields learned by RBMrate \n\nThe top half of figure 5 shows the weights of a few of the hidden units after training. All the \nunits encode global features, probably because the image normalization ensures that there \nare strong long range correlations in pixel intensities. The maximum size of the weights \nis 0.01765, with most weights having magnitudes smaller than 0.005. Note, however, that \nthe hidden unit activations range from 0 to 100. \n\nOn the left are 4 units exhibiting interesting features and on the right are 4 units chosen at \nrandom. The top unit of the first column seems to be encoding the presence of mustache in \nboth faces. The bottom unit seems to be coding for prominent right eyebrows in both faces. \nNote that these are facial features which often remain constant across images of the same \nindividual. In the second column are two features which seem to encode for different facial \nexpressions in the two faces. The right side of the top unit encodes a smile while the left \nside is expressionless. This is reversed in the bottom unit. So the network has discovered \nsome features which are fairly constant across images in the same class, and some features \nwhich can differ substantially within a class. \n\nInspired by [11], we tried to enforce local features by restricting the weights to be non-\n\n\fnegative. This is achieved by resetting negative weights to zero after each weight update. \nThe bottom half of figure 5 shows some of the hidden receptive fields learned. Except for \nthe 4 features on the left, all other features are local and code for features like mouth shape \nchanges (third column) and eyes and cheeks (fourth column). The 4 features on the left are \nmuch more global and clearly capture the fact that the direction of the lighting can differfor \ntwo images of the same person. Unfortunately, constraining the weights to be non-negative \nstrongly limits the representational power of RBMrate and makes it worse than all the other \nmethods on all the test sets. \n\n7 Conclusions \n\nWe have introduced a new method for face recognition based on a non-linear generative \nmodel. The generative model can be very complex, yet retains the efficiency required \nfor applications. Performance on the FERET database is comparable to popular methods. \nHowever, unlike other methods based on linear models, there is plenty of room for further \ndevelopment using prior knowledge to constrain the weights or additional layers of hidden \nunits to model the correlations of feature detector activities. These improvements should \ntranslate into improvements in the rate of recognition. \n\nAcknowledgements \n\nWe thank Jonathon Phillips for graciously providing us with the FERET database, the ref(cid:173)\nerees for useful comments and the Gatsby Charitable Foundation for funding. \n\nReferences \n\n[1] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Technical \nReport GeNU TR 2000-004, Gatsby Computational Neuroscience Unit, University College \nLondon, 2000. \n\n[2] P. SmoIensky. Information processing in dynamical systems: Foundations of harmony theory. \nIn D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing: Explorations \nin the Microstructure of Cognition. Volume 1: Foundations. MIT Press, 1986. \n\n[3] J. Pearl. Probabilistic reasoning in intelligent ~ystems: networks of plausible inference. Morgan \n\nKaufmann Publishers, San Mateo CA, 1988. \n\n[4] G. E. Hinton and T. J. Sejnowski. Learning and relearning in boltzmann machines. In D. E. \nRumelhart and J. L. McClelland, editors, Parallel Distributed Processing: Explorations in the \nMicrostructure of Cognition. Volume 1: Foundations. MIT Press, 1986. \n\n[5] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, \n\n3(1):71- 86,1991. \n\n[6] P. N. Belmumeur, J. P. Hespanha, and D. J. Kriegman. Eigenfaces versus fisherfaces: recogni(cid:173)\ntion using class specific linear projection. In European Conference on Computer Vision, 1996. \n[7] B. Moghaddam, W. Wahid, and A. Pentland. Beyond eigenfaces: probabilistic matching for face \nrecognition. In IEEE International Conference on Automatic Face and Gesture Recognition, \n1998. \n\n[8] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. IEEE \n\nTransactions on Pattern Analysis and Machine Intelligence, 19(7):696--710, 1997. \n\n[9] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Technical Report \n\nNCRG/97/01O, Neural Computing Research Group, Aston University, 1997. \n\n[10] P. J. Phillips, H. Moon, P. Rauss, and S. A. Rizvi. The FERET september 1996 database and \nevaluation procedure. In International Conference on Audio and Video-based Biometric Person \nAuthentication, 1997. \n\n[11] D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. \n\nNature , 401, October 1999. \n\n\f", "award": [], "sourceid": 1886, "authors": [{"given_name": "Yee Whye", "family_name": "Teh", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}