{"title": "A Variational Principle for Model-based Morphing", "book": "Advances in Neural Information Processing Systems", "page_first": 267, "page_last": 273, "abstract": null, "full_text": "A  variational principle for \n\nmodel-based  morphing \n\nLawrence K.  Saul'\"  and  Michael I.  Jordan \n\nCenter for  Biological and Computational Learning \n\nMassachusetts  Institute of Technology \n\n79  Amherst Street,  EI0-034D \n\nCambridge, MA  02139 \n\nAbstract \n\nGiven  a  multidimensional  data  set  and  a  model  of  its  density, \nwe  consider  how  to  define  the  optimal interpolation  between  two \npoints.  This is done by assigning a cost to each path through space, \nbased on two competing goals-one to interpolate through regions \nof high  density,  the other  to minimize arc  length.  From this  path \nfunctional,  we  derive  the  Euler-Lagrange  equations  for  extremal \nmotionj given two points, the desired  interpolation is found by solv(cid:173)\ning a boundary value problem.  We show that this interpolation can \nbe done efficiently, in high dimensions, for Gaussian, Dirichlet, and \nmixture models. \n\n1 \n\nIntroduction \n\nThe  problem  of non-linear  interpolation  arises  frequently  in  image,  speech,  and \nsignal processing.  Consider the following two examples:  (i) given two profiles of the \nsame face,  connect them by a smooth animation of intermediate poses[l]j  (ii) given a \ntelephone signal masked by intermittent noise, fill in the missing speech.  Both these \nexamples may be  viewed  as  instances  of the same abstract  problem.  In  qualitative \nterms,  we  can  state the  problem  as  follows[2]:  given  a  multidimensional data set, \nand  two points from  this set,  find  a  smooth adjoining path that is  consistent  with \navailable models of the  data.  We  will  refer  to  this  as  the  problem  of model-based \nmorphing. \n\nIn  this  paper,  we  examine this  problem it  arises  from  statistical  models  of multi(cid:173)\ndimensional data.  Specifically,  our focus  is  on  models that have been  derived from \n\nCurrent  address:  AT&T Labs,  600  Mountain Ave  2D-439,  Murray  Hill,  NJ 07974 \n\n\f268 \n\nL  K.  Saul and M.  I.  Jordan \n\nsome  form  of density  estimation.  Though  there  exists  a  large  body  of work  on \nthe  use  of statistical  models for  regression  and  classification,  there  has  been  com(cid:173)\nparatively little work  on  the other  types  of operations that these  models support. \nNon-linear  morphing is  an example of such  an  operation,  one  that has  important \napplications to video  email[3],  low-bandwidth  teleconferencing[4]'  and  audiovisual \nspeech  recognition [2] . \n\nA common way to describe multidimensional data is some form of mixture modeling. \nMixture models represent the data as a collection of two or more clusters;  thus, they \nare well-suited to handling complicated (multimodal) data sets.  Roughly speaking, \nfor  these  models the problem of interpolation can  be  divided  into two tasks- how \nto interpolate between  points in  the  same cluster,  and how  to interpolate between \npoints in different  clusters.  Our paper will therefore  be organized along these  lines. \n\nPrevious studies of morphing have exploited the properties of radial basis function \nnetworks[l]  and  locally  linear  models[2].  We  have  been  influenced  by  both  these \nworks,  especially  in  the  abstract formulation of the problem.  New  features  of our \napproach include:  the fundamental role played by the density, the treatment of non(cid:173)\nGaussian models, the use  of a continuous variational principle,  and the description \nof the interpolant by  a  differential equation. \n\n2 \n\nIntracluster interpolation \n\nLet Q =  {q(1), q(2), .. . , qlQI}  denote a set  of multidimensional data points,  and let \nP( q)  denote  a  model of the  distribution from  which  these  points  were  generated. \nGiven two points, our problem is to find  a smooth adjoining path that respects  the \nstatistical model of the data.  In  particular, the desired  interpolant should not pass \nthrough regions of space that the modeled density  P( q)  assigns low  probability. \n\n2.1  Clusters and metrics \n\nTo develop  these ideas further,  we  begin  by considering a special  class  of models(cid:173)\nnamely,  those  that  represent  clusters.  We  say  that  P( q)  models  a  data  cluster \nif P( q)  has  a  unique  (global)  maximum;  in  turn,  we  identify  the  location  of this \nmaximum, q\"',  as  the  prototype. \n\nLet us now consider the geometry of the space inhabited by the data.  To endow this \nspace  with a  geometric structure,  we  must define  a  metric, ga,B( q),  that provides a \nmeasure of the distance between  two nearby  points: \n\nV[q, q + dq] = [~gap(q) dqadqp r + 0 (Idql') . \n\n1 \n\n(1) \n\nIntuitively speaking,  the metric should reflect  the fact that as one moves away from \nthe  center  of the  cluster,  the  density  of the  data  dies  off  more  quickly  in  some \ndirections than in others.  A natural choice for the metric, one that meets the above \ncriteria,  is  the negative  Hessian of the log-likelihood: \n\n(2) \n\n\fA  Variational Principle for Model-based Morphing \n\n269 \n\nThis  metric  is  positive-definite  if In P( q)  is  concave;  this  will  be  true  for  all  the \nexamples we  discuss. \n\n2.2  From densities to paths \n\nThe  problem  of  model-based  interpolation  is  to  balance  two  competing  goals(cid:173)\none  to  interpolate  through  regions  of  high  density,  the  other  to  avoid  excessive \ndeformations.  Using the metric in eq.  (1),  we  can now  assign  a  cost  (or penalty) to \neach  path based on these  competing goals. \n\nConsider  the  path  parameterized  by  q(t) .  We  begin  by  dividing  the  path  into \nsegments,  each  of which  is  traversed  in  some small time interval,  dt.  We  assign  a \nvalue  to each segment  by \n\n_  {[P(q(t\u00bb]  _l}V[q(t),q(t+dt)] \n\n\u00a2(t) -\n\nP( q*) \n\ne \n\n, \n\n(3) \n\nwhere  f  ~ O.  For  reasons  that  will  become  clear  shortly,  we  refer  to  f  as  the \nline  tension.  The  value  assigned  to  each  segment  dep~n~s on  two  terms:  a  ratio \nof probabilities,  P( q(t\u00bbj P( q*),  which  favors  points  near the  prototype,  and  the \nconstant multiplier, e-l .  Both these terms  are  upper  bo~nded by  unity,  and hence \nso is  their product.  The value of the segment also decays with its length, as a result \nof the exponent, V[q(t), q(t + dt)]. \nWe  derive  a  path functional  by  piecing  these  segments  together,  multiplying their \nindividual  contributions,  and  taking  the  continuum  limit.  A  value  for  the  entire \npath is  obtained from the product: \n\ne-S  = II \u00a2(t). \n\n(4) \n\nTaking the logarithm of both sides,  and considering the limit dt  -+ 0,  we  obtain the \npath functional \n\nS[q( t)1  = J {-In [ ~~!i) 1 +l H ~ go~( q)4 0 <il dt, \n\n1 \n\n(5) \n\nwhere  q ==  it [q]  is  the  tangent  vector  to  the  path  at  time  t.  The  terms  in  this \nfunctional  balance the  two  competing goals for  non-linear  interpolation.  The first \nfavors paths that interpolate through regions of high density, while the second favors \npaths with small arc lengths;  both are  computed under  the metric induced  by  the \nmodeled  density.  The  line  tension  f  determines  the  cost  per  unit  arc  length  and \nmodulates  the  competition  between  the  two  terms.  Note  that  the  value  of  the \nfunctional  does  not depend  on the rate at which  the path is  traversed. \n\nTo  minimize  this  functional,  we  use  the  following  result  from  the  calculus  of \nvariations.  Let  \u00a3(q, q)  denote  the  integrand  of  eq.  (5),  such  that  S[q(t)]  = \nf dt  \u00a3( q, q).  Then  the  path  which  minimizes  this  functional  obeys  the  Euler(cid:173)\nLagrange equations[5]: \n\n(6) \n\n\f270 \n\nL.  K.  Saul and M. I.  Jordan \n\nWe  define  the model-based interpolant between  two points as the path which mini(cid:173)\nmizes this functional; it is found by solving the associated boundary value problem. \nThe function  C( q, q)  is known  as  the  Lagrangian.  In  the next sections,  we  present \neq. (5) for two distributions of interest-the multivariate Gaussian and the Dirichlet. \n\n2.3  Gaussian cloud \n\nThe simplest model of multidimensional data is  the multivariate Gaussian.  In this \ncase,  the data is modeled by \n\nIM11/2  {I  T \n(27r)N/2  exp  -2\"  [x  Mx] \n\n_ \nP(x) -\n\n} \n\n, \n\n(7) \n\nwhere  M  is  the  inverse  covariance  matrix and  N  is  the  dimensionality.  Without \nloss  of generality,  we nave chosen  the  coordinate  system  so  that  the  mean  of the \ndata coincides with the origin.  For the Gaussian, the mean also defines the location \nof the prototype;  moreover, from  eq.  (2),  the  metric induced  by  this model is just \nthe inverse  covariance matrix.  From eq.  (5),  we  obtain the path functional: \n\n(8) \n\nTo find  a  model-based interpolant, we seek the path that minimizes this functional. \nBecause  the  functional  is  parameterization-invariant,  it suffices  to  consider  paths \nthat are traversed at a constant (unit) rate: xTMx =  1.  From eq.  (6),  we  find  that \nthe optimal path with this parameterization satisfies: \n\n{~ [xTMx]  + f} x + [xTMx] x = x. \n\n(9) \n\nThis is  a  set  of coupled non-linear equations for  the components of x(t).  However, \nnote that at any moment in  time,  the acceleration,  x,  can  be expressed  as  a  linear \ncombination of the position,  x, and the velocity, x.  It follows that the motion of x \nlies  in  a  plane;  in  particular,  it lies  in  the  plane spanned  by  the initial conditions, \nx  and  x,  at  time t  =  O.  This enables  one  to  solve  the  boundary  value  problem \nefficiently,  even in very high  dimensions. \n\nFigure  la shows some solutions to this boundary value problem for  different  values \nof the line tension, f.  Note how  the paths bend toward the origin,  with  the  degree \nof curvature determined by the line  tension, f. \n\n2.4  Dirichlet simplex \n\nFor  many  types  of  data,  the  multivariate  Gaussian  distribution  is  not  the  most \nappropriate  model.  Suppose  that the  data points are  vectors  of positive  numbers \nwhose  elements  sum  to  one.  In  particular,  we  say  that  w  is  a  probability  vector \nif w  = (Wi, W2,  . .. , WN)  E  'RN, Wct  >  0  for  all  a,  and  I:Ct Wct  = 1.  Clearly,  the \nmultivariate Gaussian is not  suited to data of this form,  since  no  matter what the \nmean  and  covariance  matrix,  it  cannot  assign  zero  probability  to  vectors  outside \nthe simplex.  Instead,  a  more natural model is  the  Dirichlet distribution: \n\npew) = f(O) IJ ;(Oct)  , \n\n8\",-1 \n\n(10) \n\n\fA  Variational Principle for Model-based Morphing \n\n271 \n\nwhere  ()a  > 0  for  all  Ct,  and  ()  ==  :La ()a.  Here,  f(.)  is  the  gamma function,  and \n()a  are  parameters that  determine  the statistics of P(w).  Note that  P(w)  =  0 for \nvectors that are not probability vectors;  in particular, the simplex constraints on w \nare implicit assumptions of the model. \n\nWe can rewrite the Dirichlet distribution in a more revealing form as follows.  First, \nlet  w'\"  denote  the  probability vector  with  elements  w~ =  ()al().  Then,  making a \nchange of variables from w  to In w,  we  have: \n\nP(ln w)  = ;8 exp { - () [KL (w\"'llw)]}, \n\n(11) \n\n(12) \n\nwhere Z8  is a normalization factor that depends on ()a  (but not w), and the quantity \nin the exponent is ()  times the Kullback-Leibler  (KL)  divergence, \n\nKL(w\"'llw)= Lw:ln [::J. \n\na \n\nThe KL divergence measures the mismatch between wand w\"',  with KL(w\"'llw) =  0 \nif and only if w  =  w\u00b7.  Since KL(w\"'llw) has no other minima besides the one at w\"', \nwe  shall say that P(ln w)  models a  data cluster  in  the  variable In w . \n\nThe metric induced by  this modeled density is  computed by following the prescrip(cid:173)\ntion of eq.  (2).  For two nearby points inside the simplex, wand w + dw, the result \nof this prescription  is  that the squared  distance is given  by \n\nds 2  =  () L dw~ . \n\na  Wa \n\n(13) \n\nUp to a multiplicative factor of2(), eq.  (13) measures the infinitesimal KL  divergence \nbetween  wand w + dw.  This is  a  natural metric for  vectors  whose  elements  can \nbe interpreted  as  probabilities. \n\nThe  functional  for  non-linear  interpolation  is  found  by  substituting  the  modeled \ndensity and the induced metric into eq.  (5).  For the Dirichlet distribution, this gives: \n\nS[w(t)] = J {O[KL(W'IlW )] H} [O~ ::r dt. \n\n1 \n\n(14) \n\nOur problem is  to find  the  path that minimizes this functional.  Because  the func(cid:173)\ntional  is  parameterization-invariant,  it  again  suffices  to  consider  paths  that  are \ntraversed  at  a  constant  rate,  or :La w;lwa  =  1.  In  addition  to this,  however,  we \nmust  also  enforce  the  constraint  that  w  remains  inside  the  simplex;  this  is  done \nby  introducing  a  Lagrange  multiplier.  Following this  procedure,  we  find  that  the \noptimal path is  described  by: \n\n[0  KL(w'llw) HJ {w. -2~. + ~. } - 0 [~ :: Wp]  w. =  O(w. - w;).  (15) \n\nGiven  two  endpoints,  this  differential  equation  defines  a  boundary  value  problem \nfor  the  optimal path.  Unlike  before,  however,  in  this  case  the  motion of w  is  not \n\n\f272 \n\nL.  K.  Saul and M.  I.  Jordan \n\nconfined  to  a  plane.  Hence,  the  boundary  value  problem  for  eq.  (15)  does  not \ncollapse to one  dimension, as  does  its Gaussian counterpart,  eq.  (9). \n\nTo remedy  this situation,  we  have  developed  an efficient  approximation that finds \na  near-optimal interpolant,  in lieu  of the  optimal one.  This  is  done  in  two  steps: \nfirst,  by solving eq.  (15)  exactly  in the limit \u00a3 -+ 00;  second,  by  using this limiting \nsolution,  WOO(t),  to find  the lowest-cost  path that can  be expressed  as  the  convex \ncombination: \n\nw(t) =  m(t)w'\" + [1  - m(t)) WOO(t). \n\n(16) \n\nThe lowest-cost path of this form is found by substituting eq. (16) into the Dirichlet \nfunctional, eq.  (14),  and solving the Euler-Lagrange equations for  m(t).  The moti(cid:173)\nvation for  eq.  (16)  is  that for  finite \u00a3,  we  expect  the optimal interpolant to deviate \nfrom WOO(t)  and bend toward the prototype at w*.  In practice,  this approximation \nworks very well,  and by collapsing the boundary value problem to one dimension, it \nallows cheap  computation of the Dirichlet  interpolants.  Some  paths from  eq.  (16), \nas  well  as the \u00a3 -+ 00 paths on which they are based,  are shown in figure  lb.  These \npaths were  computed for  the twelve  dimensional simplex (N = 12), then  projected \nonto the  WI w2-plane. \n\n3 \n\nIntercluster interpolation \n\nThe  Gaussian  and  Dirichlet  distributions  of  the  previous  section  are  clearly  in(cid:173)\nadequate  for  modeling  for  multimodal  data  sets.  In  this  section,  we  extend  the \nvariational principle  to mixture models,  which  describe  the  data as a  collection  of \nk  2:  2 clusters.  In particular,  suppose the data is modeled by \n\nk \n\nP(q) = L 7rz P(qlz). \n\nz=1 \n\n(17) \n\nHere,  we  have assumed that the conditional densities P( qlz) model data clusters  as \ndefined  in section  2.1,  and  the  coefficients  7rz  =  P(z)  define  prior  probabilities for \nthe latent variable,  z  E {I, 2, ... , k}. \n\nThe crucial step for  mixture models is to develop  the appropriate generalization of \neq.  (5).  To this end, let \u00a3z(q, q) denote the Lagrangian derived from the conditional \ndensity,  P(qlz),  and  \u00a3z  the line tensioni  that appears  in this  Lagrangian.  We  now \ncombine these  Lagrangians into a single functional: \n\nS[q(t), z(t)) = J dt  \u00a3z(t)(q, q). \n\n(18) \n\nNote that eq.  (18)  is  a functional of two arguments,  not one.  For  mixture models, \nwhich define a joint density P(q, z) = 7rzP(qlz), our goal is to find the optimal path \nin  the joint space  q  \u00ae z.  Here,  z(t)  is  a  piecewise-constant  function  of time  that \nassigns  a  discrete  label to each  point  along the path;  in other  words,  it  provides  a \ntemporal segmentation of the path, q(t).  The purpose of z(t)  in eq. (18)  is to select \nwhich  Lagrangian is  used  to compute the contribution from the interval  [t, t + dt). \n\nlTo respect the weighting of the mixture components in eq.  (17),  we set the line tensions \naccording to iz =  i-In 'Trz.  Thus, components with higher weights have lower line tensions. \n\n\fA  Variational Principle for Model-based Morphing \n\n273 \n\no. \n\n'lI \n\n-0. \n\n0.7 \n\n0.0 \n\n0.5 \n\n0 .. \n~0.3 \n0.2 \n\n0.1 \n\n00 \n\n'.CI> \n\n'.0 \n\n, \n'. \n-.---,.,\\~.-.-.-. \n\n* \n\n'liS \n\n.. ': 1.(1) \nj \n\n0.4 \n\n0.8 \n\n* \n\n-6 \n\n0 \nxl \n\n\u2022 \n\n'.1 \n\n0 .\u2022 \n\nW1 \n\nFigure 1:  Model-based morphs for  (a)  Gaussian distribution; (b) Dirichlet distribu(cid:173)\ntion; (c)  mixture of Gaussians.  The prototypes are shown as asterisks; f  denotes the \nline  tension.  Figure  lc shows  the  convergence  of the iterative algorithm; n  denotes \nthe number of iterations. \n\nAs  before,  we  define  the  model-based interpolant  as  the path  q(t)  that  minimizes \neq.  (18).  In this case, however,  both q(t) and z(t) must be simultaneously optimized \nto  recover  this  path.  We  have  implemented  an  iterative  scheme  to  perform  this \noptimization, one that alternately (i) estimates the segmentation z(t),  (ii) computes \nthe  model-based  interpolant  within  each  cluster  based  on  this  segmentation,  and \n(iii) reestimates the points (along the cluster boundaries) where  z(t) changes value. \nIn short,  the strategy is  to optimize z(t)  for  fixed  q(t), then optimize q(t) for fixed \nz(t) . \n\nFigure  lc shows  how  this algorithm operates on a  simple mixture of Gaussians.  In \nthis  example,  the  covariance  matrices  were  set  equal  to  the  identity  matrix,  and \nthe means of the Gaussians were  distributed along a circle  in the  Xlx2-plane.  Note \nthat  with  each  iteration,  the  interpolant  converges  more  closely  to  the  path  that \ntraverses this circle.  The effect is similar to the manifold-snake algorithm of Bregler \nand  Omohundro[2]. \n\n4  Discussion \n\nIn this paper we have proposed a variational principle for model-based interpolation. \nOur framework  handles Gaussian, Dirichlet, and mixture models, and the resulting \nalgorithms  scale  well  to  high  dimensions.  Future  work  will  concentrate  on  the \napplication to real  images. \n\nReferences \n\n[1]  T. Poggio and F.  Girosi.  Networks for  approximation  and learning.  Proc.  of IEEE,  vol \n\n78:9  (1990). \n\n[2]  C. Bregler and S.  Omohundro.  Nonlinear image interpolation  using  manifold learning. \nIn  G.  Tesauro,  D.  Touretzky,  and  T.  Leen  (eds.).  Advances  in  Neural  Information \nProcessing Systems  7,  973-980.  MIT Press,  Cambridge,  MA  (1995). \n\n[3]  T .  Ezzat.  Example based  analysis  and  synthesis  for images  of faces.  MIT  EECS M.S. \n\nthesis (1996). \n\n[4]  D.  Beymer,  A.  Shashua,  and  T.  Poggio.  Example based image  analysis  and synthesis. \n\nAI Memo  1161,  MIT  (1993). \n\n[5]  H.  Goldstein.  Classical Mechanics.  Addison-Wesley,  London  (1980). \n\n\f", "award": [], "sourceid": 1283, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}