{"title": "Correlation and Interpolation Networks for Real-time Expression Analysis/Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 909, "page_last": 916, "abstract": null, "full_text": "Correlation and Interpolation Networks for \nReal-time Expression Analysis/Synthesis. \n\nTrevor Darrell, Irfan Essa, Alex Pentland \n\nPerceptual Computing Group \n\nMIT Media Lab \n\nAbstract \n\nWe describe a framework for real-time tracking of facial expressions \nthat uses neurally-inspired correlation and interpolation methods. A \ndistributed view-based representation is used to characterize facial state, \nand is computed using a replicated correlation network. The ensemble \nresponse of the set of view correlation scores is input to a network based \ninterpolation method, which maps perceptual state to motor control states \nfor a simulated 3-D face model. Activation levels of the motor state \ncorrespond to muscle activations in an anatomically derived model. By \nintegrating fast and robust 2-D processing with 3-D models, we obtain a \nsystem that is able to quickly track and interpret complex facial motions \nin real-time. \n\n1 INTRODUCTION \n\nAn important task for natural and artificial vision systems is the analysis and interpretation \nof faces. To be useful in interactive systems and in other settings where the information \nconveyed is of a time critical nature, analysis of facial expressions must occur quickly, or be \noflittle value. However, many of the traditional computer vision methods for estimating and \nmodeling facial state have proved difficult to perform fast enough for interactive settings. \nWe have therefore investigated neurally inspired mechanisms for the analysis of facial \nexpressions. We use neurally plausible distributed pattern recognition mechanisms to make \nfast and robust assessments of facial state, and multi-dimensional interpolation networks to \nconnect these measurements to a facial model. \n\nThere are many potential applications of a system for facial expression analysis. Person-\n\n\f910 \n\nTrevor Darrell. Irfan Essa. Alex Pentland \n\nalized interfaces which sense a users emotional state, ultra-low bitrate video conferencing \nwhich sends only facial muscle activations, as well as the enhanced recognition systems \nmentioned above. We have focused on a application in computer graphics which stresses \nboth the analysis and synthesis components of our system: interactive facial animation. \n\nIn the next sections we develop a computational framework for neurally plausible expression \nanalysis, and the connection to a physically-based face model using a radial basis function \nmethod. Finally we will show the results of these methods applied to the interactive \nanimation task, in which an computer graphics model of a face is rendered in real time, and \nmatches the state of the users face as sensed through a conventional video camera. \n\n2 EXPRESSION MODELINGffRACKING \n\nThe modeling and tracking of expressions and faces has been a topic of increasing interest \nrecently. In the neural network field, several successful models of character expression \nmodeling have been developed by Poggio and colleagues. These models apply multi(cid:173)\ndimensional interpolation techniques, using the radial basis function method, to the task \nof interpolating 2D images of different facial expression. Librande [4] and Poggio and \nBrunelli [9] applied the Radial Basis Function (RBF) method to facial expression mod(cid:173)\neling, using a line drawing representation of cartoon faces. In this model a small set of \ncanonical expressions is defined, and intermediate expressions constructed via the interpo(cid:173)\nlation technique. The representation used is a generic \"feature vector\", which in the case of \ncartoon faces consists of the contour endpoints. Recently, Beymer et al. [1] extended this \napproach to use real images, relying on optical flow and image warping techniques to solve \nthe correspondence and prediction problems, respectively. \n\nRBF-based techniques have the advantage of allowing for the efficient and fast compu(cid:173)\ntation of intermediate states in a representation. Since the representation is simple and \nthe interpolation computation straight-forward, real-time implementations are practical on \nconventional systems. These methods interpolate between a set of 2D views, so the need for \nan explicit 3-D representation is sidestepped. For many applications, this is not a problem, \nand may even be desirable since it allows the extrapolation to \"impossible\" figures or ex(cid:173)\npressions, which may be of creative value. However, for realistic rendering and recognition \ntasks, the use of a 3-D model may be desirable since it can detect such impossible states. \n\nIn the field of computer graphics, much work has been done on on the 3-D modeling of \nfaces and facial expression. These models focus on the geometric and physical qualities \nof facial structure. Platt and B adler [7], Pieper [6], Waters [11] and others have developed \nmodels of facial structure, skin dynamics, and muscle connections, respectively, based on \navailable anatomical data. These models provide strong constraints for the tracking of \nfeature locations on a face. Williams et. al. [12] developed a method in which explicit \nfeature marks are tracked on a 3-D face by use of two cameras. Terzopoulos and Waters [10] \ndeveloped a similar method to track linear facial features, estimate corresponding parameters \nof a three dimensional wireframe face model, and reproduce facial expression. A significant \nlimitation of these systems is that successful tracking requires facial markings. Essa and \nPentland [3] applied optical flow methods (see also Mase [5]) for the passive tracking of \nfacial motion, and integrated the flow measurement method into a dynamic system model. \nTheir method allowed for completely passive estimation of facial expressions, using all the \nconstraints provided by a full 3-D model of facial expression. \n\nBoth the view based method of Beymer et. al. and the 3-D model of Essa and Pentland rely \n\n\fCorrelation and Interpolation Networks for Real-Time Expression Analysis/Synthesis \n\n911 \n\n( \n\n(b) \n\nFigure 1: (a) Frame of video being processed to extract view model. Outlined rectangle \nindicates area of image used for model. (b) View models found via clustering method on \ntraining sequence consisting of neutral, smile, and surprise expressions. \non estimates of optic flow, which are difficult to compute reliably, especially in real-time. \nOur approach here is to combine interpolated view-based measurements with physically \nbased models, to take advantage of the fast interpolation capability of the RBF and the \npowerful constraints imposed by physically based models. We construct a framework in \nwhich perceptual states are estimated from real video sequences and are interpolated to \ncontrol the motor control states of a physically based face model. \n\n3 VIEW-BASED FACE PERCEPTION \n\nTo make reliable real-time measurements of a complex dynamic object, we use a distributed \nrepresentation corresponding to distinct views of that object. Previously, we demonstrated \nthe use of this type of representation for the tracking and recognition of hand gestures [2]. \nLike faces, hands are complex objects with both non-rigid and rigid dynamics. Direct use \nof a 3-D model for recognition has proved difficult for such objects, so we developed a \nview-based method for representation. Here we apply this technique to the problem of facial \nrepresentation, but extend the scheme to connect to a 3-D model for high-level modeling \nand generation/animation. With this, we gain the representational power and constraints \nimplied by the 3-D model as a high-level representation; however the 3-D model is only \nindirectly involved in the perception stage, so we can still have the same speed and reliability \nafforded by the view-based representation. \n\nIn our method each view characterizes a particular aspect or pose of the object being \nrepresented. The view is stored iconically, that is, it is a literal image or template (but \nwith some point-wise statistics) of the appearance of the object in that aspect or pose. A \nmatch criteria is defined between views and input images; usually a normalized correlation \nfunction is used, but other criteria are possible. An input image is represented by the \nensemble of match scores from that image to the stored views. \n\nTo achieve invariance across a range of transformations, for example translation, rotation \nand/or scale, units which compute the match score for each view are replicated at different \nvalues of each transformation. I The unit which has maximal response across all values of \nthe transformation is selected, and the ensemble response of the view units which share the \n\n1 In a computer implementation this exhaustive sampling may be impractical due to the num(cid:173)\nber of units needed, in which case this stage may be approximated by methods which are hybrid \nsampling/search methods. \n\n\f912 \n\nTrevor Darrell, Irfan Essa, Alex Pentland \n\nsame transformation values as the selected unit is stored as the representation for the input \nimage. We set the perceptual state X to be a vector containing this ensemble response. \n\nIf the object to be represented is fully known a priori, then methods to generate views can \nbe constructed by analysis of the aspect graph if the object is polyhedral, or in general \nby rendering images of the object at evenly spaced rotations. However, in practice good \n3-D models that are useful for describing image intensity values are rare2, so we look to \ndata-driven methods of acquiring object views. \n\nAs described in [2] a simple clustering algorithm can find a set of views that \"span\" a \ntraining sequence of images, in the sense that for each image in the sequence at least one \nview is within some threshold similarity to that image. The algorithm is as follows. Let \nV be the current set of views for an object (initially one view is specified manually). For \neach frame I of a training sequence, if at least one v E V has a match value M (v, 1) that is \ngreater than a threshold (J, then no action is performed and the next frame is processed. If \nno view is close, then I is used to construct a new view which is added to the view set. A \nview v' is created using a window of I centered at the location in the previous image where \nthe closest view was located. (All views usually share the same window size, determined \nby the initial view.) The view set is then augmented to include the new view: V = V u v'. \nThis algorithm will find a set of views which well-characterizes an object across the range \nof poses or expressions contained in the training sequence. For example, in the domain of \nhand gestures, inputing a training sequence consisting of a waving hand will yield views \nwhich contain images of the hand at several different rotations. In the domain of faces, \nwhen input a training sequence consisting of a user performing 3 different expressions, \nneutral, smile, and surprise, this algorithm (with normalized correlation and (J = 0.7) found \nthree views corresponding to these expressions to represent the face, as shown in Figure \nl(b). These 3 views serve as a good representation for the face of this user as long as his \nexpression is similar to one in the training set. \n\nThe major advantage of this type of distributed view-based representation lies in the reduc(cid:173)\ntion of the dimensionality of the processing that needs to occur for recognition, tracking, or \ncontrol tasks. In the gesture recognition domain, this dimensionality reduction allowed for \nconventional recognition strategies to be applied successfully and in real-time, on examples \nwhere it would have been infeasible to evaluate the recognition criteria on the full signal. In \nthe domain explored in this paper it makes the interpolation problem of much lower order: \nrather than interpolate from thousands of input dimensions as would be required when the \ninput is the image domain, the view domain for expression modeling tasks typically has on \nthe order of a dozen dimensions. \n\n4 3-D MODELINGIMOTOR CONTROL \n\nTo model the structure of the face and the dynamics of expression performance, we use the \nphysically based model of Essa et. al. This model captures how expressions are generated \nby muscle actuations and the resulting skin and tissue deformations. The model is capable \nof controlled nonrigid deformations of various facial regions, in a fashion similar to how \nhumans generate facial expressions by muscle actuations attached to facial tissue. Finite \nElement methods are used to model the dynamics of the system. \n\n2 As opposed to modeling forces and shape deformations, for which 3-D models are useful and \n\nindeed are used in the method presented here. \n\n\fCorrelation and Interpolation Networks for Real-Time Expression Analysis/Synthesis \n\n9 J 3 \n\n(a) \n\n(b) \n\n1 \n\nl\\I1odel \n\n(e) \n\n(d) \n\n900. \n8 0<5'> core \n700 \n\nParalTleters \n\n0.2 \no ActuatIons \n-0.2 \n\n. \n\nFigure 2: (a) Face images used as input, (b) normalized correlation scores X(t) for each \nview model, (c) resulting muscle control parameters Y(t), (d) rendered images of facial \nmodel corresponding to muscle parameters. \n\n\f914 \n\nTrevor Darrell, Irfan Essa, Alex Pentland \n\nThis model is based on the mesh developed by Platt and Badler [7], extended into a \ntopologically invariant physics-based model through the addition of a dynamic skin and \nmuscle model [6, 11]. These methods give the facial model an anatomically-based facial \nstructure by modeling facial tissue/skin, and muscle actuators, with a geometric model to \ndescribe force-based deformations and control parameters. \n\nThe muscle model provides us with a set of control knobs to drive the facial state, defined \nto be a vector Y. These serve to define the motor state of the animated face. Our task now \nis to connect the perceptual states of the observed face to these motor states. \n\n5 CONNECTING PERCEPTION WITH ACTION \n\nWe need to establish a mapping from the perceptual view scores to the appropriate muscle \nactivations on the 3-D face model. To do this, we use multidimensional interpolation \nstrategies implemented in network form. \n\nInterpolation requires a set of control points or exemplars from which to derive the desired \nmapping. Example pairs of real faces and model faces for different expressions are presented \nto the interpolation method during a training phase. This can be done in one of two ways, \nwith either a user-driven or model-driven paradigm. In the model-driven case the muscle \nstates are set to generate a particular expression by an animator/programmer and then the \nuser is asked to make the equivalent expression. The resulting perceptual (view-model) \nscores are then recorded and paired with the muscle activation levels. In the user-driven \ncase, the user makes an expression of hislher own choosing, and the optic flow method \nof Essa et. al. is used to derive the corresponding muscle activation levels. The model(cid:173)\ndriven paradigm is simpler and faster, but the user-driven paradigm yields more detailed \nand authentic facial expressions. \n\nWe use the Radial Basis Function (RBF) method presented in [8], and define the interpolated \nmotor controls to be a weighted sum of radial functions centered at each example: \n\nn \n\nY = I:ci9(X - Xi) \n\ni=1 \n\n(1) \n\nwhere Y are the muscle states, X are the observed view-model scores, Xi are the example \nscores, 9 is an RBF (and in our case was simply a linear ramp 9(\u00a7) = II\u00a7II), and the weights \nCi are computed from the example motor values Yi using the pseudo-inverse method [8]. \n\n6 \n\nINTERACTIVE ANIMATION SYSTEM \n\nThe correlation network, RBF interpolator, and facial model described above have been \ncombined into a single system for interactive animation. The entire system can be updated \nat over 5 Hz, using a dedicated single board accelerator to compute the correlation network, \nand an SGI workstation to render the facial mesh. \u00b7 Here we present two examples of the \nprocessing performed by the system, using different strategies for coupling perceptual and \nmotor state. \n\nFigure 2 illustrates one example of real-time facial expression tracking using this system, \nusing a full-coupling paradigm. Across the top, labeled (a), are five frames of a video \nsequence of a user making a smile expression. This was one of the expressions used in the \ntraining sequence for the view models shown in Figure 1 (b), so they were applicable to be \n\n\fCorrelation and Interpolation Networks for Real-Time Expression Analysis/Synthesis \n\n915 \n\n(a) \n\n(b) \n\nFigure 3: (a) Processing of video frame with independent view model regions for eyes, \neye-brows, and mouth region. (b) Overview shot of full system. User is on left, vision \nsystem and camera is on right, and animated face is in the center of the scene. The animated \nface matches the state of the users face in real-time, including eye-blinks (as is the case in \nthis shot.) \n\nused here. Figure 2(b) shows the correlation scores computed for each of the 3 view models \nfor each frame of the sequence. This constituted the perceptual state representation, X(t). \nIn this example the full face is coupled with the full suite of motor control parameters. An \nRBF interpolator was trained using perceptual/motor state pairs for three example full-face \nexpressions (neutral, smile, surprise); the resulting (interpolated) motor control values, \nyet), for the entire sequence are shown in Figure 2(c). Finally, the rendered facial mesh for \nfive frames of these motor control values is shown in Figure 2( d). \n\nWhen there are only a few canonical expressions that need be tracked/matched, this full-face \ntemplate approach is robust and simple. However if the user wishes to exercise independent \ncontrol of the various regions of the face, then the full coupling paradigm will be overly \nrestrictive. For example, if the user trains two expressions, eyes closed and eyes open, and \nthen runs the system and attempts to blink only one eye, the rendered face will be unable to \nmatch it. (In fact closing one eye leads to the rendered face half-closing both eyes.) \n\nA solution to this is to decouple the regions of the face which are independent geometrically \n(and to some degree, in terms of muscle effect.) Under this paradigm, separate correla(cid:173)\ntion networks are computed for each facial regions, and multiple RBF interpolations are \nperformed for each system. Each interpolator drives a distinct subset of the motor state \nvector. Figure 3(a) shows the regions used for decoupled local templates. In these examples \nindependent regions were used for each eye, eyebrow, and the mouth region. \n\nFinally, figure 3 (b) shows a picture of the set-up of the system as it is being run in an \ninteractive setting. The animated face mimics the facial state of the user, matching in real \ntime the position of the eyes, eyelids, eyebrows and mouth of the user. In the example \nshown in this picture, the users eyes are closed, so the animated face's eyes are similarly \nclosed. Realistic performance of animated facial expressions and gestures are are possible \n\n\f916 \n\nTrevor Darrell, Irfan Essa, Alex Pentland \n\nthrough this method, since the timing and levels of the muscle activations react immediately \nto changes in the users face. \n\n7 CONCLUSION \n\nWe have explored the use of correlation networks and Radial Basis Function techniques for \nthe tracking of real faces in video sequences. A distributed view-based representation is \ncomputed using a network of replicated normalized correlation units, and offers a fast and \nrobust assesment of perceptual state. 3-D constraints on facial shape are achieved through \nthe use of a an anatomically derived facial model, whose muscle activations are controled \nvia interolated perceptual states using the RBF method. \n\nWith this framework we have been able to acheive the fast and robust analysis and synthesis \nof facial expressions. A modeled face mimics the expression of a user in real-time, using \nonly a conventional video camera sensor and no special marking on the face of the user. This \nsystem has promise as a new approach in the interactive animation, video tele-conferencing, \nand personalized interface domains. \n\nReferences \n\n[1] D. Beymer, A. Shashua, and T. Poggio, Example Based Image Analysis and Synthesis, \n\nMIT AI Lab TR-1431, 1993. \n\n[2] T. Darrell and A. Pentland. Classification of Hand Gestures using a View-Based \n\nDistributed Representation In NIPS-6, 1993. \n\n[3] I. Essa and A. Pentland. A vision system for observing and extracting facial action \n\nparameters. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1994. \n\n[4] S. Librande. Example-based Character Drawing. S.M. Thesis, Media Arts and Sci(cid:173)\n\nencelMedia Lab, MIT. 1992 \n\n[5] K. Mase. Recognition of facial expressions for optical flow. IEICE Transactions, \n\nSpecial I ssue on Computer Vision and its Applications, E 74( 1 0), 1991. \n\n[6] S. Pieper, J. Rosen, and D. Zeltzer. Interactive graphics for plastic surgery: A task \n\nlevel analysis and implementation. Proc. Siggraph-92, pages 127-134, 1992. \n\n[7] S. M. Platt and N. I. Badler. Animating facial expression. ACM SIGGRAPH Confer(cid:173)\n\nence Proceedings, 15(3):245-252,1981. \n\n[8] T. Poggio and F. Girosi. A theory of networks for approximation and learning. MIT \n\nAI Lab TR-1140, 1989. \n\n[9] T. Poggio and R. Brunelli, A Novel Approach to Graphics, MIT AI Lab TR- 1354. \n\n1992. \n\n[10] D. Terzopoulus and K. Waters. Analysis and synthesis offacial image sequences using \n\nphysical and anatomical models. IEEE Trans. PAMI, 15(6):569-579, June 1993. \n\n[11] K. Waters and D. Terzopoulos. Modeling and animating faces using scanned data. \n\nThe Journal of Visualization and Computer Animation, 2: 123-128, 1991. \n\n[12] L. Williams. Performance-driven facial animation. ACM SIGGRAPH Conference \n\nProceedings, 24(4):235-242, 1990. \n\n\f", "award": [], "sourceid": 999, "authors": [{"given_name": "Trevor", "family_name": "Darrell", "institution": null}, {"given_name": "Irfan", "family_name": "Essa", "institution": null}, {"given_name": "Alex", "family_name": "Pentland", "institution": null}]}