{"title": "Forward Dynamics Modeling of Speech Motor Control Using Physiological Data", "book": "Advances in Neural Information Processing Systems", "page_first": 191, "page_last": 198, "abstract": null, "full_text": "Forward  Dynamics  Modeling \n\nof Speech  Motor  Control \nUsing  Physiological  Data \n\nMakoto  Hirayama  Eric  Vatikiotis-Bateson  Mitsuo  Kawato \n\nA TR Auditory and Visual Perception Research Laboratories \n\n2 - 2, Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, JAPAN \n\nMichael  I.  Jordan \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nWe  propose  a  paradigm  for  modeling  speech  production  based  on  neural \nnetworks.  We  focus  on characteristics  of the  musculoskeletal  system.  Using \nreal physiological data - articulator movements and EMG from  muscle activity(cid:173)\na  neural  network  learns  the  forward  dynamics  relating  motor  commands  to \nmuscles  and  the  ensuing  articulator  behavior.  After  learning,  simulated \nperturbations, were used to asses properties of the acquired model, such as natural \nfrequency,  damping,  and  interarticulator  couplings.  Finally,  a  cascade neural \nnetwork is  used  to  generate  continuous  motor commands from  a  sequence  of \ndiscrete articulatory targets. \n\n1  INTRODUCTION \nA key problem in  the  formal  study  of human  language is  to  understand  the  process by \nwhich  linguistic  intentions  become  speech.  Speech  production  entails  extraordinary \ncoordination among  diverse neurophysiological  and anatomical  structures  from  which \nunfolds  through  time  a  complex  acoustic  signal  that  conveys  to  listeners  something of \nthe speaker's intention.  Analysis of the speech acoustics has not revealed the encoding of \nthese intentions,  generally  conceived  to be ordered strings  of some basic  unit,  e.g.,  the \nphoneme.  Nor  has  analysis  of the  articulatory  system  provided  an  answer,  although \nrecent pioneering work by Jordan (1986), Saltzman (1986), Laboissiere (1990) and others \n\n191 \n\n\f192 \n\nHirayama, Vatikiotis-Bateson,  Kawato, and Jordan \n\nhas brought us closer to an understanding of the articulatory-to-acoustic transform and has \ndemonstrated the importance of modeling the articulatory  system's temporal properties. \nHowever,  these efforts  have  been limited to  kinematic  modeling  because they have  not \nhad access to the neuromuscular activity of the articulatory structures. \nIn  this  study,  we are  using neural  networks to  model speech production.  The principle \nsteps of this endeavor are shown in Figure 1.  In this paper, we focus  on characteristics of \nthe  musculoskeletal  system.  Using real  physiological data - articulator movements  and \nEMG from muscle activity - a neural network learns the forward dynamics relating motor \ncommands  to  muscles  and  the  ensuing  articulator behavior.  After  learning,  a  cascade \nneural  network  model  (Kawato,  Maeda,  Uno,  &  Suzuki,  1990)  is  used \nto  generate \ncontinuous motor commands. \n\nIntention to Speak \n\nIntended Phoneme Sequence \n\nGlobal Performance Parameters \n\nTransformation from  Phoneme to Gesture \n\nArticulatory Targets \n\nMotor Command Generation \n\nMotor Command \n\nMusculo-Skeletal System \n\nArticulator Trajectories \n\nTransformation from  Articulatory Movement to Acoustic Signal \n\nAcoustic Wave Radiation \n\nFigure 1: Forward Model of Speech Production \n\n2  EXPERIMENT \nMovement, EMG, and acoustic data were recorded for one speaker who produced reiterant \nversions of two sentences.  Speaking rate was fast and the reiterant syllables were ba, boo \nFigure  2  shows  approximate  marker  positions  for  tracking  positions  of  the  jaw \n(horizontal and vertical) and lips  (vertical only) and muscle insertion points for hooked(cid:173)\nwire, bipolar EMG recording from four muscles:  ABD (anterior belly of the digastric) for \njaw lowering, OOI( orbicularis oris inferior) and MTL (mentalis) for lower lip raising and \nprotrusion, and GGA (genioglossus anterior) for tongue tip lowering. \nAll movement and EMG (rectified and integrated) signals were digitized (12 bit) at 200 Hz \nand then numerically smoothed at 40 Hz.  Position signals were differentiated to  obtain \nvelocity  and  then,  after  smoothing  at 22  Hz,  differentiated  again  to  get  acceleration. \nFigure 3  shows data for one reiterant utterance using ba. \n\n\fForward Dynamics Modeling of Speech  Motor Control \n\n193 \n\nArticulator \nUL:  u~er lip (vertical) \nlower lip (vertical) \nLL \njaw (horizontal) \nJX \nJY \njaw (vertical) \n\nMuscle \nABD  :  anterior belly of the  digastric \n001  : orbicularis oris inferior \nMTL  : mentalis \nGGA : genioglossus anterior \n\noOI-___ ~r~ \n\nMTL  .:..-:-~~~ \n\nABD \n\nFigure 2:  Approximate Positions of Markers and Muscle Insertion \n\nfor Recording Movement and EMG \n\nAudio \nUL \ntJ)  LL \n0 \nQ. \nJX \nJY \nUL \n-'  LL \nw \n> \nJX \nJY \nUL \nLL \nJX \nJY \nABO \nCl  001 \n:E \nW  MTL \nGGA \n\n0 \n0 \nC( \n\n0 \n\n1 \n\n2 \n\nTime \n\n[51 \n\n3 \n\n4 \n\n5 \n\nFigure 3:  Time Series Representations for All Channels \n\nof One Reiterant Rendition Using ba \n\n\f194 \n\nHirayama,  Vatikiotis-Bateson,  Kawaro,  and Jordan \n\n3  FORWARD  DYNAMICS  MODELING  OF  THE  MUSCULO(cid:173)\nSKELETAL  SYSTEM  AND  TRAJECTORY  PREDICTION \nFROM  MUSCLE  EMG \n\nThe  forward dynamics model (FDM)  for  ba, bo production was  obtained using a three(cid:173)\nlayer perceptron  with  back propagation  (Rumelhart,  Hinton,  &  Williams,  1986).  The \nnetwork  learns  the  correlations  between  position,  velocity,  EMG  at  time  1  and  the \nchanges of position and velocity for  all  articulators at the next time  sample 1+1. \nAfter learning, the forward dynamics model is connected recurrently as shown in Figure 4. \nThe  network  uses  only  the  initial  articulator  position  and  velocity  values  and  the \ncontinuous  EMG  \"motor command\" input to  generate predicted trajectories.  The FDM \nestimates the changes of position and velocity and sums  them with position and velocity \nvalues  of the previous sample Ito obtain estimated values at the  next sample 1+1. \nFigure 5 compares experimentally observed trajectories with trajectories predicted by this \nnetwork.  Spatiotemporal characteristics are very  similar, e.g., amplitude, frequency,  and \nphase, and demonstrate the generally good perfonnance of the model.  There is, however, \na tendency towards  negative offset in the predicted positions.  There  are  two important \nlimitations that reduce  the  current model's ability to compensate for position shifts in the \ntest utterance.  First, there is no specified equilibrium or rest position in articulator space, \ntowards  which  articulators  might  tend  in  the  absence  of EMG  activity.  Second,  the \nacquired FDM is based on limited EMG;  at most there is  correlated EMG for  only one \ndirection of motion per articulator.  Addition of antagonist EMG and/or  an  estimate of \nequilibrium position  in  articulator  or,  eventually,  task  coordinates  should  increase  the \nmodel's generalization capability. \n\nPosition \n\nVelocity \n\nEMG \n\n~ \n\nPredicted \nTrajectory \n\nForward \nDynamiCS \n\nModel \n\n6Position \n\nPosition \n\n6Velocity \n\nVelocity \n\n\"v' \n\n1\\/ \n\nFigure 4: Recurrent Network for Trajectory Prediction from Muscle EMG \n\n\fForward Dynamics Modeling of Speech Motor Control \n\n195 \n\nNetwork Output  ~~~ .. \"  Observed Trajectory \n\nUL \nc \n0  LL \n;:: \n'iii \n0 \nQ.  JX \n\nJY \n\nUL \n\n>- LL -'u \n\n0 \nQ) \n\n>  JX \nJY \n\n0  123   4 \n\nTime \n\n[s] \n\n5 \n\nFigure 5: Experimentally Observed vs.  Predicted Trajectories \n\n4  ESTIMATION  OF  DYNAMIC  PARAMETER \nTo investigate quantitative characteristics of the obtained forward dynamics  model,  the \nmodel system's response to two types of simulated perturbation were examined. \nThe  first  simulated perturbation  confirmed  that  the  model  system  indeed  learned  an \nappropriate nonlinear dynamics  and  affords  a  rough  estimation of the  its  visco-elastic \nproperties, such as natural frequency (1.0 Hz) and damping ratio (0.24).  Simulated release \nof the  lower  lip  at  various  distances  from  rest  revealed  underdamped  though  stable \nbehavior, as  shown in Figure 6a. \nThe second perturbation entailed observing articulator response to a step increase (50 % of \nfull-scale)  in EMG  activity  for  each  muscle.  Figure  6b demonstrates  that the  learned \nrelation between EMG input and articulator movement output is  dynamical rather than \nkinematic because articulator responses are not instantaneous.  Learned responses to each \nmuscle's activation also show some interesting and reasonable (though not always correct) \ncouplings between different articulators. \n\n\f196 \n\nHirayama, Vatikiotis-Bateson,  Kawato,  and Jordan \n\n0.6 \n\n0.4 \n\na 0.2 \n-~ en .f 0.0 \n\n-0.2 \n\nABO \n\n,,~~~~~h~~~~~~~~~~ \n.. ~  ~ \n\n~ \n\n- - r  -- -- -: -- -- -- t  -- -- -: \n\n001  ~\"\"~_-!'--~--.-i--...-\"\"\"'----! \n\nUL \nLL \nJX \nJY \n\n1 \n\n2 \n3 \nTIme [s] \n\n4 \n\n5 \n\n- :   : \n\nMTL  \\.~ ,~ ...  ~ .-. .-. I' ................ . ~ .....................  J\"\"II  r. _.:i \n: \nGGA  ~~~h~~~~~~~~~~~~~~ \n. \n5 \n\n. \n. \n4 \n\n0 \n\n1 \n\n-\n\n: \n\n: \n\n. \n. \n2 \n3 \nTIme [s) \n\na.  Release of Lower Lip \n\nfrom Rest Position + 0.2 \n\nb.  Response of Step \n\nIncrease (+0.5) in EMG \n\nFigure 6: Visco-Elastic Property of the FDM Observed by Simulated Perturbations \n\n5  MOTOR  COMMAND  GENERATION  USING  CASCADE \n\nNEURAL  NETWORK  MODEL \n\nObserved articulator movements are smooth.  Their smoothness is due partly to physical \ndynamic properties (inertia, viscosity).  Furthermore,  smoothness may be an attribute of \nthe  motor  command  itself,  thereby  resolving  the  ill-posed  computational  problem  of \ngenerating  continuous  motor  commands  from  a  small  number  of discrete  articulatory \ntargets. \nTo test this,  we  incorporated a smoothness constraint on the  motor  command (rectified \nEMG, in this  case), which is  conceptually similar to previously proposed constraints on \nchange of torque (Uno,  Kawato,  & Suzuki,  1989) and muscle-tension (Uno, Suzuki,  & \nKawato,  1989).  Two articulatory target (via-point)  constraints were specified spatially, \none for consonant closure and the other for vowel opening, and assigned to each of the 21 \nconsonant + vowel  syllables.  The alternating  sequence  of via-points  was  isochronous \n(temporally equidistant) except for  initial, medial and final  pauses.  The cascade neural \nnetwork  (Figure  7)  then  generated  smooth  EMG  and  articulator  trajectories  whose \nspatiotemporal  asymmetry  approximated  the  prosodic  patterning  of the  natural  test \nutterances  (Figure 8).  Although this  is  only  a preliminary implementation of via-point \nand  smoothness  constraints,  the  model's  ability  to  generate  trajectories  of appropriate \nspatiotemporal complexity from  a series of alternating via-point inputs is encouraging. \n\n\fForward Dynamics Modeling of Speech Motor Control \n\n~ .\"  t sequence of articulatory targets \n\n197 \n\ninitial gesture \nposition \nvelocity \n\nr--'----........., \n\n+.  ~ position, ~ velocity \n\nFDM \n\nFDM \n\n\u2022\u2022\u2022 \n\nFDM \n\n... ~\"...-.-\n\nposition, velocity \nmotor command \n\nrealized \narticulator \ntrajectory \n\nsmoothness \nconstraint \n\n. \ntime  ---- ------------- ---------~ \n\ngenerated motor command \n\nmusclulo-skeletal  ....... ~ \nsystem \n\nFigure 7:  Cascade Neural Network Model for Motor Command Generation \n\nUL \n\nLL \n\nJX \n\nJY \n\nCJ \n:E \nw \n\nABO \n\n001 \n\nMTL \n\nGGA \n\n0 \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\nTime  [5] \n\nFigure 8: Generated Motor Command (EMG) with Trajectory \n\nTo Satisfy Articulatory Targets \n\n\f198 \n\nHirayama, Vatikiotis-Bateson,  Kawato,  and Jordan \n\n6  CONCLUSION  AND  FUTURE  WORK \nOur intent here has been to provide a preliminary model of speech production based on the \narticulatory system's dynamical properties.  We  used real physiological data - EMG(cid:173)\nto  obtain  the  forward dynamics  model  of the  articulators  from  a multilayer perceptron. \nAfter training, a recurrent network predicted articulator trajectories using the EMG signals \nas  the motor command input.  Simulated perturbations were used to examine the  model \nsystem's  response  to  isolated  inputs  and  to  assess  its  visco-elastic  properties  and \ninterarticulator couplings.  Then, we incorporated a reasonable  smoothness  criterion -\nminimum-motor-command-change  -\nrealistic trajectories from a bead-like string of via-points. \nWe  are  now  attempting  to  model  various  styles  of real  speech  using  data  from  more \nmuscles  and  articulators  such  as  the  tongue.  Also,  the  scope  of the  model  is  being \nexpanded to  incorporate  global perfonnance parameters for  motor command generation, \nand  the  transformations  from  phoneme  to  articulatory  gesture  and  from  articulatory \nmovement to acoustic signal. \nFinally,  a  main  goal  of our  work  is  to  develop  engineering  applications  for  speech \nsynthesis and recognition.  Although our model is still preliminary, we believe resolving \nthe  difficulties  posed  by  coarticulation,  segmentation,  prosody,  and  speaking  style \nultimately depends on understanding physiological and computational aspects  of speech \nmotor control. \n\ninto  a  cascade  neural  network  that  generated \n\nAcknowledgem ent \nWe  thank  Vincent  Gracco  and  Kiyoshi  Oshima  for  muscle  insertions;  Haskins \nLaboratories  for  use  of their  facilities  (NIH  grant DC-00121);  Kiyoshi  Honda,  Philip \nRubin,  Elliot  Saltzman  and  Yoh'ichi  Toh'kura for  insightful  discussion;  and  Kazunari \nNakane and Eiji Yodogawa for continuous encouragement.  Further support was provided \nby HFSP grants to M. Kawato and M.  I. Jordan. \n\nReferences \nJordan, M.  I. (1986) Serial order: a parallel distributed processing approach, ICS (Institute \n\nfor Cognitive Science,  University  of California) Report. 8604. \n\nKawato,  M .\u2022  Maeda.  M.,  Uno,  Y.  &  Suzuki.  R.  (1990)  Trajectory  Formation  of Arm \nMovement by  Cascade  Neural  Network  Model  Based on  Minimum Torque-change \nCriterion.  Bioi.  Cybern.62,  275-288. \n\nLaboissiere, R .\u2022  Schwarz. 1.  L.  &  Bailly.  G.  (1990)  Motor Control  for  Speech Skills:  a \nConnectionist Approach. Proceeding o/the 1990 Summer School, Morgan Kaufmann \nPublishers, 319-327. \n\nRumelhart, D.E., Hinton. G.E.  &  Williams, RJ.(1986) Learning Internal Representation \n\nby Error Propagation. Parallel Distributed Processing  Chap. 8. MIT Press. \n\nSaltzman.  E.L.  (1986)  Task  dynamics  coordination  of  the  speech  articulators:  A \n\npreliminary model. Experimental Brain Research. Series 15. 129-144. \n\nUno.  Y.,  Kawato.  M.,  &  Suzuki,  R.  (1989)  Formation  and  Control  of  Optimal \n\nTrajectory in Human  Multijoint Arm  Movement, Bioi.  Cybern.  61,  89-101. \n\nUno,  Y.,  Suzuki.  R.  &  Kawato,  M.  (1989)  Minimum  muscle-tension-change  model \nwhich  reproduces  human  arm  movement.  Proceedings  of the  4th  symposium  on \nBiological and Physiological Engineering, 299-302, in Japanese. \n\n\f", "award": [], "sourceid": 448, "authors": [{"given_name": "Makoto", "family_name": "Hirayama", "institution": null}, {"given_name": "Eric", "family_name": "Vatikiotis-Bateson", "institution": null}, {"given_name": "Mitsuo", "family_name": "Kawato", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}