{"title": "Task and Spatial Frequency Effects on Face Specialization", "book": "Advances in Neural Information Processing Systems", "page_first": 17, "page_last": 23, "abstract": null, "full_text": "Task and Spatial Frequency Effects on Face \n\nSpecialization \n\nMatthew N. Dailey  Garrison W. Cottrell \n\nDepartment of Computer Science and Engineering \n\nU.C. San Diego \n\nLa Jolla, CA 92093-0114 \n\n{mdailey,gary}@cs.ucsd.edu \n\nAbstract \n\nThere is  strong  evidence that face  processing is  localized in  the brain. \nThe  double  dissociation  between  prosopagnosia,  a  face  recognition \ndeficit occurring after brain damage, and visual object agnosia, difficulty \nrecognizing otber kinds of complex objects, indicates tbat face and non(cid:173)\nface  object recognition may be served by partially independent mecha(cid:173)\nnisms in the brain.  Is neural specialization innate or learned?  We sug(cid:173)\ngest  that this  specialization  could be tbe result  of a competitive learn(cid:173)\ning mechanism that, during development, devotes neural resources to the \ntasks they are best at performing. Furtber, we suggest that the specializa(cid:173)\ntion arises as an interaction between task requirements and developmen(cid:173)\ntal  constraints.  In  this paper,  we present a feed-forward computational \nmodel  of visual processing, in which  two  modules compete to  classify \ninput stimuli.  When  one module receives low  spatial  frequency infor(cid:173)\nmation  and  the  other  receives high  spatial  frequency  information,  and \nthe task is to identify the faces while simply classifying the objects, the \nlow frequency network shows a strong specialization for faces.  No otber \ncombination  of tasks  and  inputs  shows  this  strong  specialization.  We \ntake these results as support for  the idea that an  innately-specified face \nprocessing module is unnecessary. \n\n1  Background \n\nStudies of the preserved and impaired abilities in brain damaged patients provide important \nclues on how the brain is organized. Cases of pro sop agnosia, a face recognition deficit often \nsparing recognition  of non-face objects,  and visual  object agnosia,  an  object recognition \ndeficit that can occur witbout appreciable impairment of face recognition, provide evidence \nthat face  recognition  is  served  by  a  \"special\"  mechanism.  (For  a recent review  of this \n\n\f18 \n\nM.  N.  Dailey and G.  W.  Cottrell \n\nevidence,  see Moscovitch,  Winocur,  and Behrmann  (1997\u00bb.  In  this  study,  we begin  to \nprovide a computational account of the double dissociation. \n\nEvidence indicates that face recognition is based primarily on holistic, configural informa(cid:173)\ntion, whereas non-face object recognition relies more heavily on local features and analysis \nof the parts of an object (Farah, 1991; Tanaka and Sengco, 1997). For instance, the distance \nbetween the tip of the nose and an eye in a face is an important factor in face recognition, \nbut such subtle measurements are rarely as  critical for  distinguishing, say,  two buildings. \nThere is  also  evidence that  con figural  information  is highly relevant when  a  human  be(cid:173)\ncomes  an  \"expert\" at  identifying individuals  within  other  visually  homogeneous object \nclasses (Gauthier and Tarr, 1997). \n\nWhat role might configural information play in the development of a specialization for face \nrecognition? de Schonen and Mancini (1995) have proposed that several factors, including \ndifferent rates of maturation in  different areas of cortex,  an  infant's tendency to track the \nfaces in  its environment,  and  the gradual increase in  visual  acuity  as  an  infant develops, \nall combine to force an early specialization for face recognition.  If this scenario is correct, \nthe infant begins to form configural face representations very soon  after birth, based pri(cid:173)\nmarily on  the low spatial frequency information present in  face  stimuli.  Indeed, Costen, \nParker,  and Craw (1996) showed that although both high-pass and low-pass image filter(cid:173)\ning decrease face recognition accuracy, high-pass filtering degrades identification accuracy \nmore quickly than  low-pass filtering.  Furthermore, Schyns and Oliva (1997) have shown \nthat when asked to recognize the identity of the \"face\" in a briefly-presented hybrid image \ncontaining a low-pass filtered image of one individual's face and a high-pass filtered image \nof another individual's face, subjects consistently use the lOW-frequency compone.nt of the \nimage for the task. This work indicates that low spatial frequency information may be more \nimportant for face identification than high spatial frequency information. \n\nJacobs and Kosslyn (1994) showed how differential availability of large and small receptive \nfield  sizes in  a mixture of experts network  (Jacobs,  Jordan, Nowlan,  and Hinton,  1991) \ncan  lead  to  experts that specialize for  \"what\"  and  \"where\" tasks.  In  previous work,  we \nproposed that a neural mechanism allocating resources according to their ability to perform \na  given  task could explain  the  apparent specialization  for  face recognition  evidenced by \nprosopagnosia (Dailey,  Cottrell,  and Padgett,  1997).  We  showed that  a model based on \nthe  mixture  of experts  architecture,  in  which  a  gating  network implements  competitive \nlearning between two  simple homogeneous modules,  could develop  a specialization such \nthat damage to one module disproportionately impaired face recognition compared to non(cid:173)\nface object recognition. \n\nIn the current study, we consider how the availability of spatial frequency information af(cid:173)\nfects face recognition specialization given this hypothesis of neural resource allocation by \ncompetitive learning.  We  find  that when  high  and  low  frequency  information is \"split\" \nbetween the two modules in our system, and the task is to identify the faces while simply \nclassifying the objects,  the low-frequency module consistently specializes for  face recog(cid:173)\nnition. After describing the study, we discuss its results and their implications. \n\n2  Experimental Methods \n\nWe presented a modular feed-forward neural network preprocessed images of 12 differ(cid:173)\nent faces,  12 different books,  12 different cups,  and 12 different soda cans.  We  gave the \nnetwork two types of tasks: \n\n1.  Learning to recognize the superordinate classes of all four object types (hereafter \n\nreferred to  as classification). \n\n2.  Learning to distinguish the individual members of one class (hereafter referred to \n\n\fTask and Spatial Frequency Effects on Face Specialization \n\n19 \n\nas identification) while simply classifying objects of the other three types. \n\nFor each task, we investigated the effects of high and low spatial frequency information on \nidentification and classification in a visual processing system with two competing modules. \nWe  observed how  splitting  the range  of spatial  frequency  information  between  the  two \nmodules affected the specializations developed by the network. \n\n2.1 \n\nImage Data \n\nWe  acquired face images from the Cottrell and Metcalfe facial expression database (1991 ) \nand captured multiple images of several books, cups, and soda cans with  a CCD  camera \nand video frame grabber. For the face images, we chose five grayscale images of each of 12 \nindividuals. The images were photographed under controlled lighting and pose conditions; \nthe subjects portrayed a different facial expression in each image. For each of the non-face \nobject classes,  we captured five different grayscale images of each of 12 books,  12 cups, \nand  12 cans.  These images were also  captured under controlled lighting conditions,  with \nsmall variations in position and orientation between photos. The entire image set contained \n240 images, each of which we cropped and scaled to a size of 64x64 pixels. \n\n2.2 \n\nImage Preprocessing \n\nTo convert the raw grayscale images to a biologically plausible representation more suitable \nfor  network learning  and  generalization,  and  to  experiment with  the effect  of high  and \nlow spatial frequency information available in  a stimulus, we extracted Gabor jet features \nfrom the images  at multiple spatial frequency scales then performed a separate principal \ncomponents analysis on the data from each filter  scale separately to reduce input pattern \ndimensionality. \n\n2.2.1  Gabor jet features \n\nThe basic two-dimensional Gabor wavelet resembles a sinusoid grating restricted by a two(cid:173)\ndimensional  Gaussian,  and  may  be tuned  to  a particular  orientation  and  sinusoidal  fre(cid:173)\nquency scale. The wavelet can be used to model simple cell receptive fields in cat primary \nvisual cortex (Jones and Palmer,  1987).  Buhmann, Lades,  and von der Malsburg (1990) \ndescribe the Gabor \"jet,\" a vector consisting of filter responses at multiple orientations and \nscales. \n\nfi \n\nters  at  ve  sca es m  elg  t  onentatlons \n\nWe convolved each of the 240 images in  the input data set with two-dimensional Gabor \nfil \nan \n8x8  grid of the responses to  each  filter.  The process resulted in  2560 complex numbers \ndescribing each image. \n\nan  su  samp \n\n311'  711') \n\n(0  11' \n\nled \n\n'8' 4' 8' 2' T' 4' 8 \n\n511' \n\n311' \n\n11' \n\n11' \n\nd \n\nb \n\nI\n\n\u00b7\n\n\u00b7  h \n\n. \n\n. \n\n2.2.2  Principal components analysis \n\nTo reduce the dimensionality of the Gabor jet representation while maintaining a segrega(cid:173)\ntion of the responses from each filter  scale, we performed a separate PCA on each spatial \nfrequency component of the pattern vector described above. For each of the 5 filter  scales \nin the jet, we extracted the subvectors corresponding to  that scale from each pattern in the \ntraining set, computed the eigenvectors of their covariance matrix, projected the sub vectors \nfrom each of the patterns onto these eigenvectors,  and retained the eight most significant \ncoefficients. Reassembling the pattern set resulted in 240 40-dimensional vectors. \n\n\f20 \n\nM. N.  Dailey and G. W.  Cottrell \n\nModule 1 \n\nInputs \n\nFigure 1:  Modular network architecture.  The gating network units mix the outputs of the \nhidden layers multiplicatively. \n\n2.3  The Model \n\nThe model is a simple modular feed-forward network inspired by the mixture of experts \narchitecture (Jordan and Jacobs, 1995); however, it contains hidden layers and is trained by \nbackpropagation of error rather than maximum likelihood estimation or expectation maxi(cid:173)\nmization.  The connections to the output units come from two separate inputlhidden layer \npairs;  these connections are gated multiplicatively by  a simple linear network with  soft(cid:173)\nmax outputs.  Figure 1 illustrates the model's architecture.  During training, the network's \nweights are adjusted by backpropagation of error.  The connections from the softmax units \nin  the gating network to  the connections between the hidden  layers and output layer can \nbe thought of as  multiplicative connections with  a  constant  weight  of 1.  The resulting \nlearning rules  gate the amount  of error feedback received  by a module according  to  the \ngating network's current estimate of its ability to process the current training pattern. Thus \nthe model implements a form of competitive learning in  which the gating network learns \nwhich module is better able to process a given pattern and rewards the \"winner\" with more \nerror feedback. \n\n2.4  Training Procedure \n\nPreprocessing the images resulted in  240 40-dimensional vectors;  four examples of each \nface  and object composed a  I92-element training set,  and one example of each face  and \nobject composed a 48-element test set.  We  held out  one example of each individual in \nthe training set for use in determining when to  stop network training.  We set the learning \nrate for  all network weights to  0.1 and their momentum to  0.5.  Both of the hidden layers \ncontained 15 units in  all  experiments.  For the identification  tasks,  we determined that  a \nmean squared error (MSE) threshold of 0.02 provided adequate classification performance \non  the hold out set without overtraining and allowed the  gate network to  settle to  stable \nvalues.  For the four-way classification task, we found that an MSE threshold of 0.002 was \nnecessary to  give the gate network time to  stabilize and did not result in  overtraining.  On \nall runs reported in the results section, we simply trained the network until it reached the \nrelevant MSE threshold. \n\nFor each of the tasks reported in  the results section (four-way classification, book identi(cid:173)\nfication,  and face identification), we performed two experiments.  In the first,  as  a control, \nboth modules and the gating network were trained and tested with the fu1l40-dimensional \npattern vector.  In the second, the gating network received the full 40-dimensional vector, \n\n\fTask and Spatial Frequency Effects on Face Specialization \n\n21 \n\nbut module 1 received a vector in which the elements corresponding to the largest two Ga(cid:173)\nbor filter scales were set to 0, and the elements corresponding to the middle filter scale were \nreduced by 0.5. Module 2, on the other hand, received a vector in which the elements cor(cid:173)\nresponding to the smallest two filter scales were set to 0 and the elements corresponding to \nthe middle filter were reduced by 0.5. Thus module 1 received mostly high-frequency infor(cid:173)\nmation, whereas module 2 received mostly low-frequency information, with deemphasized \noverlap in the middle range. \n\nFor each of these six experiments, we trained the network using 20 different initial random \nweight sets and recorded the softmax outputs learned by the gating network on each training \npattern. \n\n3  Results \n\nFigure 2  displays the resulting degree of specialization of each module on each stimulus \nclass.  Each chart plots the average weight the gating network assigns to each module for \nthe training patterns from each stimulus class, averaged over 20 training runs with different \ninitial random weights. The error bars denote standard error. For each of the three reported \ntasks (four-way claSSification, book identification, and face identification), one chart shows \ndivision of labor between the two modules in the control situation, in which both modules \nreceive the same patterns,  and the other chart shows  division  of labor  between  the  two \nmodules when one module receives low-frequency information and the other receives high(cid:173)\nfrequency information. \n\nWhen required to identify faces on the basis of high- or lOW-frequency information, com(cid:173)\npared with the four-way-classification and same-pattern controls, the lOW-frequency mod(cid:173)\nule wins the competition for face patterns extremely consistently (lower right graph). Book \nidentification  specialization,  however,  shows  considerably  less  sensitivity  to  spatial  fre(cid:173)\nquency. \n\nWe  have also performed the equivalent experiments with  a cup discrimination  and  a can \ndiscrimination task.  Both of these tasks show a low-frequency sensitivity lower than  that \nfor face identification but higher than that for book identification. Due to space limitations, \nthese results are not presented here. \n\nThe specialized face  identification networks also  provide good models of prosopagnosia \nand visual object agnosia:  when the face-specialized module's output is \"damaged\" by re(cid:173)\nmoving connections from its hidden layer to the output layer, the overall network's general(cid:173)\nization performance on  face identification drops dramatically, while its generalization per(cid:173)\nformance on  object recognition drops much more slowly.  When  the non-face-specialized \n(high frequency) module'S outputs are damaged, the opposite effect occurs: the overall net(cid:173)\nwork's performance on each of the object recognition tasks drops, whereas its performance \non face identification remains high. \n\n4  Discussion \n\nThe results in Figure 2 show a strong preference for low-frequency information in the face \nidentification  task,  empirically demonstrating that,  given  a choice,  a competitive mecha(cid:173)\nnism will choose a module receiving low-frequency, large receptive field  information for \nthis task.  This result concurs  with  the psychological evidence for  configural face  repre(cid:173)\nsentations based upon low spatial frequency information, and suggests how the developing \nbrain could be biased toward a specialization for  face recognition by the infant's initially \nlow visual acuity. \n\nOn  the basis of these results,  we predict that human subjects performing face  and object \n\n\f22 \n\n1.0 \n\n0.8 \n\nM.  N  Dailey and G.  W.  Cottrell \n\nClassification (control) \n\nClassification (split frequencies) \n\n\\.0 \n\n0.8 \n\ni \n'0;  0.6 \n~ \ni \n\n0.4 \n\n: \n~ \n\n0.2 \n\n0.0 \n\n\\.0 \n\n0.8 \n\nFaces  Books  Cups  Cans \n\nSdmulusType \n\nBookid task (control) \n\nj \n-;  0.6 \n\nf 0.4 \n\n~ \n\nFaces  Books  Cups  Cans \n\nSdmul .. Type \n\nFace id task (control) \n\n0.2 \n\n0.0 \n\n1.0 \n\n0.8 \n\ni -;  0.6 \nr : \n\n0.4 \n\n~ \n\n0.2 \n\n0.0 \n\n1m Module 1 \na  Module 2 \n\nr:::::J  Module 1 \n. .  Module 2 \n\nC!I Module 1 \n. .  Module 2 \n\ni \n\n\"11 \n\n0.6 \n\n~ r 0.4 \n\n~ \n\nFaces  Books  Cups  Cans \n\nStimulus Type \n\nBook id task (split frequencies) \n\nFaces  Books  Cups  Cans \n\nStimulus Type \n\nFace id task (split frequencies) \n\n0.2 \n\n0.0 \n\n1.0 \n\n0.8 \n\nj \n-;  0.6 \nr : \n\n0.4 \n\n~ \n\n0.2 \n\n0.0 \n\n1.0 \n\n0.8 \n\ni -;  0.6 r 0.4 \n\n~ \n\n0.2 \n\n0.0 \n\nCModule 1 \n(highfreq) \n\u2022  Module 2 \n(low freq) \n\nc:I Module 1 \n(highfreq) \n. .  Module 2 \n(low freq) \n\nCl Module 1 \n(high freq) \nIilII Module 2 \n(low freq) \n\nFaces  Books  Cup.  Cans \n\nStimulus Type \n\nFaces  Books  Cups  Cans \n\nStimulus Type \n\nFigure 2: Average weight assigned to each module broken down by stimulus class. For each \ntask, in the control experiment, each module receives the same pattern; the split-frequency \ncharts summarize the specialization resulting when module 1 receives high-frequency Ga(cid:173)\nbor filter information and module 2 receives low-frequency Gabor filter information. \n\n\fTask and Spatial Frequency Effects on Face Specialization \n\n23 \n\nidentification tasks will show more degradation of performance in high-pass filtered images \nof faces than in high-pass filtered images of other objects.  To our knowledge, this has not \nbeen empirically tested, although Costen et al. (1996) have investigated the effect of high(cid:173)\npass and low-pass filtering  on face images in isolation, and Parker, Lishman, and Hughes \n(1996) have investigated the effect of high-pass and low-pass filtering  of face  and  object \nimages used as 100 ms cues for  a same/different task.  Their results indicate that relevant \nhigh-pass filtered images cue object processing better than low-pass filtered images, but the \ntwo types of filtering cue face processing equally well.  Similarly, Schyns & Oliva's (1997) \nresults described earlier suggest that the human face identification network preferentially \nresponds to low spatial frequency inputs. \n\nOur results suggest that simple data-driven competitive learning combined with constraints \nand biases known  or thought to  exist during  visual  system development can  account for \nsome of the effects observed in normal and brain-damaged humans.  The study lends sup(cid:173)\nport to  the claim that there is no need for an innately-specified face processing module -\nface recognition is only \"special\" insofar as faces  form a remarkably homogeneous cate(cid:173)\ngory of stimuli for which Within-category discrimination is ecologically beneficial. \n\nReferences \n\nBuhmann,  J.,  Lades,  M.,  and  von  der  Malsburg,  C.  (1990).  Size and  distortion  invari(cid:173)\nIn  Proceedings of the  IJCNN \n\nant  object recognition by hierarchical graph  matching. \nInternational Joint Conference on Neural Networks, volume II, pages 411-416. \n\nCosten,  N.,  Parker,  D.,  and  Craw,  I.  (1996).  Effects  of high-pass  and  low-pass  spatial \n\nfiltering on face identification.  Perception & Psychophysics, 38(4):602-612. \n\nCottrell, G. and Metcalfe, J.  (1991).  Empath: Face, gender and emotion recognition using \nIn  Lippman,  R.,  Moody,  J.,  and  Touretzky,  D.,  editors, Advances in  Neural \n\nholons. \nInformation ProceSSing Systems 3, pages 564-571. \n\nDailey, M.,  Cottrell, G.,  and Padgett,  C.  (1997).  A mixture of experts model exhibiting \npro sop agnosia.  In  Proceedings of the Nineteenth Annual Conference of the  Cognitive \nScience Society,  pp. 155-160. Stanford, CA, Mahwah: Lawrence Erlbaum. \n\nde  Schonen,  S.  and  Mancini,  J.  (1995).  About functional brain specialization:  The de(cid:173)\n\nvelopment of face recognition.  TR 95.1, MRC Cognitive Development Unit, London, \nUK. \n\nFarah, M. (1991).  Patterns of co-occurrence among the associative agnosias:  Implications \n\nfor  visual object representation.  Cognitive Neuropsychology, 8:1-19. \n\nGauthier, I. and Tarr, M. (1997).  Becoming a \"greeble\" expert: Exploring mechanisms for \n\nface recognition.  Vision Research.  In press. \n\nJacobs,  R.  and  Kosslyn,  S.  (1994).  Encoding shape and spatial relations - The role of \nreceptive field size in  coordinating complementary representations.  Cognitive Science, \n18(3):361-386. \n\nJacobs,  R, Jordan,  M.,  Nowlan,  S.,  and  Hinton,  G.  (1991).  Adaptive mixtures  of local \n\nexperts.  Neural Computation, 3:79-87. \n\nJones, J.  and Palmer, L. (1987).  An evaluation of the two-dimensional Gabor filter model \n\nof simple receptive fields in cat striate cortex.  1  Neurophys., 58(6):1233-1258. \n\nMoscovitch,  M.,  Winocur,  G.,  and  Behrmann,  M.  (1997).  What  is special  about  face \nrecognition? Nineteen experiments on a person with visual object agnosia and dyslexia \nbut normal face recognition.  Journal of Cognitive Neuroscience, 9(5):555-604. \n\nParker, D., Lishman, J., and Hughes, J. (1996).  Role of coarse and fine spatial information \nin face and object processing.  Journal of Experimental Psychology: Human Perception \nand Performance, 22(6):1445-1466. \n\nSchyns, P.  and Oliva, A. (1997). Dr. Angry and Mr. Smile: The multiple faces of perceptual \n\ncategorizations. Submitted for publication. \n\nTanaka,  J.  and  Sengco, 1.  (1997).  Features  and  their configuration  in  face  recognition. \n\nMemory and Cognition.  In press. \n\n\f", "award": [], "sourceid": 1399, "authors": [{"given_name": "Matthew", "family_name": "Dailey", "institution": null}, {"given_name": "Garrison", "family_name": "Cottrell", "institution": null}]}