{"title": "Adaptation in Speech Motor Control", "book": "Advances in Neural Information Processing Systems", "page_first": 38, "page_last": 44, "abstract": "", "full_text": "Adaptation in  Speech Motor Control \n\nJohn F. Houde* \nUCSF Keck  Center \n\nBox 0732 \n\nSan Francisco,  CA  94143 \nhoude~phy.ucsf.edu \n\nMichael I.  Jordan \n\nMIT Dept.  of Brain and Cognitive Sci. \n\nEI0-034D \n\nCambridge, MA  02139 \njordan~psyche.mit.edu \n\nAbstract \n\nHuman  subjects  are  known  to  adapt  their  motor  behavior  to  a \nshift  of the  visual  field  brought  about  by  wearing  prism  glasses \nover their eyes.  We  have studied the analog of this effect  in speech. \nU sing  a  device  that  can  feed  back  transformed  speech  signals  in \nreal  time,  we  exposed  subjects  to  alterations  of their  own  speech \nfeedback.  We  found  that speakers learn to adjust their production \nof a  vowel  to compensate for  feedback  alterations that  change the \nvowel's perceived phonetic identity; moreover, the effect generalizes \nacross consonant contexts and to different  vowels. \n\n1 \n\nINTRODUCTION \n\nFor  more  than  a  century,  it  has  been  know  that  humans  will  adapt their  reaches \nto altered visual feedback  [8].  One of the most studied examples of this adaptation \nis  prism adaptation, which  is  seen when  a subject  reaches to targets while  wearing \nimage-shifting  prism  glasses  [2].  Initially,  the  subject  misses  the  targets,  but  he \nsoon  learns  to  compensate  and  reach  accurately.  This  compensation  is  retained \nbeyond  the  time  that  the  glasses  are  worn:  when  the  glasses  are  removed,  the \nsubject's reaches now overshoot targets in the direction that he  compensated.  This \nretained  compensation  is  called  adaptation,  and  its  generation  from  exposure  to \naltered sensory feedback  is  called  sensorimotor adaptation  (SA). \nIn  the  study  reported  here,  we  investigated  whether  SA  could  be  observed  in  a \nmotor task that is  quite  different  from  reaching - speech  production.  Specifically, \nwe  examined  whether  the  control  of phonetically  relevant  speech  features  would \nrespond  adaptively  to  altered  auditory  feedback.  By  itself,  this  is  an  important \ntheoretical question because various aspects of speech production have already been \nshown to be sensitive to auditory feedback [5,  1,  4].  Moreover, we were particularly \n\n*To whom correspondence should be addressed. \n\n\fAdaptation in Speech Motor Control \n\n39 \n\ninterested in whether speech SA  would also exhibit generalization.  If so, speech SA \ncould  be  used  to examine  the organization of speech  motor control.  For  example, \nsuppose  we  observed adaptation of [c)  in  \"get\".  We  could  then  examine  whether \nwe  also see adaptation of [c)  in  \"peg\".  IT so,  then producing [c)  in the two  different \nwords  must  access  a  common,  adapted representation - evidence for  a  hierarchical \nspeech  production  system  in  which  word  productions  are  composed  from  smaller \nunits  such  as  phonemes.  We  could  also  examine  whether  adapting  [c)  in  \"get\" \ncauses adaptation of [re)  in  \"gat\".  IT so,  then the production representations of [c) \nand [re]  could not be independent, supporting the idea that vowels are produced by \ncontrolling a  common set  of features.  Such  theories  about  the organization of the \nspeech  production  system  have  been  postulated  in  phonology  and  phonetics,  but \nthe  empirical evidence  supporting these  theories  has  generally  been  observational \nand hence  not  entirely conclusive  [7,6]. \n\n2  METHODS \n\nTo study speech SA,  we focused  on vowel  production because the phonetically rel(cid:173)\nevant  features  of vowel  sounds are formant  frequencies,  which  are feasible  to alter \nin real time. 1 \nTo  alter the formants of a subject's speech feedback,  we  built the apparatus shown \nin  Figure  1.  The  subject  wears  earphones  and  a  microphone  and sits  in  front  of \na  PC video  monitor that  presents words  to  be spoken aloud.  The signal from  the \nmicrophone  is  sent  to  a  Digital  Signal  Processing  board,  which  collects  a  64ms \ntime interval from  which  a magnitude spectrum is  calculated.  From this spectrum, \nformant  frequencies  and  amplitudes  are  estimated.  To  alter  the  speech,  the  first \nthree  formant  frequencies  are  shifted,  and  the  shifted  formants  drive  a  formant \nsynthesizer  that  creates  the  output  speech  sent  to  the  subject's  earphones.  This \nanalysis-synthesis process  was  accomplished with only  16ms of feedback  delay.  To \nminimize how much the subject directly heard of his own voice via bone conduction, \nthe subject produced only whispered speech,  masked  with  mild  noise. \n\naltered \nfeedback \n\n;-------... \n\n.... \nearphones. \n\npep \n\nDSP board \n\nin PC \n\nPC video monitor \n\n.'\" \nmicrophone .' \n\nAltered Fonnants \n\nI  I  \u2022. \n\nFl.F2.F3 \nFrequency \nAlteration. \n'. \nFonnants \nI Fonnant Estimation I \nV\\ /\\ \u2022 \n\nI \n\n/\n\n\u2022 \n\nL....-_~ ..... L . . . . . - - - '  . .  ~. \n\nFigure  1:  The apparatus used in the study. \n\nMagnitude Spectrum \n\nFor  each  subject  in  our  experiment,  we  shifted  formants  along  the  path  defined \nby the (F1,F2,F3) frequencies of a subject's productions of the vowels [i) , [t.]'  [c],  [re], \n\n1 See  [3]  for  detailed discussion  of the methods used  in this study. \n\n\f40 \n\nJ.  F.  Houde and M.  I  Jordan \n\nand  [a].2  Figure 2 shows examples of this shifting process in  (Fl,F2)  space for  the \nfeedback  transformations that were  used in the study.  TOI shift formants  along the \nsubject's [i]-[a]  path, we  extend the path at both ends and we  number the endpoints \nand vowels to make a path position measure that normalizes the distances between \nvowels.  The formants  of each  speech  sound  F  produced by  the subject  were  then \nre-represented in terms of path projection - the path posiUon of nearest path point \nP,  and path deviation - the  distance  D to  this  point  P.  Feedback  transformations \nwere  constructed  to alter  path projections while  preserving path deviations.  Two \ndifferent  transformations  were  used.  The  +2.0  transformation  added  2.0  to  path \nprojections:  under this transform, if the subject produced speech sound F  (a sound \nnear  [cD,  he  heard instead sound  F+  (a sound  near  [aD.  The subject  could  com(cid:173)\npensate for this transform and hear sound F  only by  shifting his  production of F  to \nF- (a sound near [iD.  The -2.0 transformation subtracted 2.0 from path projections: \nunder  this  transformation,  if the  subject  produced  F,  he heard  F-.  Thus,  in  this \ncase, the subject could compensate by  shifting production to F+. \n\nF-\n\nF+ \n\nF+ \n\n[ah)O 5 \n\nend  /)6 \n\nFl \n\nend  /)6 \n\nFl \n\n(a)  +2.0 Transformation \n\n(b) -2.0 Transformation \n\nFigure 2:  Feedback transformations used in the study. \n\nThese feedback transformations were used in an experiment in  which a subject was \nvisually prompted to whisper words with a 300ms target duration_  Word promptings \noccurred in groups of ten called epochs.  Within each epoch, the first six words came \nfrom  a  set  of training  words  and  the  last  four  came  from  a  set  of  testing  words. \nThe subject  heard feedback  of his  first  five  word productions in each epoch,  while \nmasking  noise  blocked  his  hearing  for  his  remaining  five  word  productions  in  the \nepoch.  Thus,  the  subject  only  heard  feedback  of his  production  of  the  first  five \ntraining words and never  heard his  productions of the testing words. \n\n2Where possible,  we  use  standard phonetic symbols for  vowel  sounds:  [i]  as  in  \"seat\", \n[L]  as  in  \"hit\",  [c]  as in  \"get\",  [re]  as in  \"hat\", and [a]  as in  \"pop\".  Where font  limitations \nprevent us  from  using  these symbols,  we  use  the alternate  notation of [i],  [ib],  [eh] ,  rae], \nand lab],  respectively, for  the same vowel  sounds. \n\n\fAdaptation in Speech Motor Control \n\n41 \n\nThe experiment lasted 2 hours and consisted of 422 epochs divided over five  phases: \n\n1.  A 10 minute warmup phase used to acclimate the subject to the experimen(cid:173)\n\ntal setup. \n\n2.  A 17 minute  baseline  phase used  to measure formants of the subject's nor(cid:173)\n\nmal  vowel  productions. \n\n3.  A  20  minute  ramp  phase in  which  the subject's feedback  was  increasingly \n\naltered up to a maximum  value. \n\n4.  A  1  hour  training  phase  in  which  the  subject  produced  words  while  the \n\nfeedback  was  maximally altered. \n\n5.  A  17  minute  test  phase  used  to  measure  formants  of the  subject's  post-\n\nexposure vowel  productions while his feedback was maximally altered. \n\nBy the end of the ramp phase, feedback  alteration reached its maximum strength, \nwhich  was  +2.0 for  half the subjects  and -2.0 for  the other subjects.  In addition, \nall subjects were run in  a  control experiment in which feedback  was  never altered. \n\nThe  two  word  sets  from  which  prompted  words  were  selected  were  both  sets  of \neve words.  Training words  (in  which  adaptation was  induced)  were  all  bilabials \nwith  [c]  as  the vowel  (\"pep\",  \"peb\",  \"bep\",  and  \"beb\").  Testing  words  (in  which \ngeneralization of the training word adaptation was measured)  were divided into two \nsubsets,  each  designed  to  measure  a  different  type  of generalization:  (1)  context \ngeneralization words, which had the same vowel  [c]  as the training words but varied \nthe  consonant  context  (\"peg\",  \"gep\",  and  \"teg\");  (2)  vowel  target  generalization \nwords, which  had the same consonant context as the training words  but varied  the \nvowel  (\"pip,\",  \"peep,\", \"pap\" , and  \"pop\"). \nEight  male  MIT  students  participated  in  the  study.  All  were  native  speakers  of \nNorth American English and all were naive to the purpose of the study. \n\n3  RESULTS \n\nTo  illustrate  how  we  measured  compensation  and adaptation in  the  experiments, \nwe  first  show the results for  an individual subject.  Figure 3 shows  (F1,F2)  plots of \nresponse of subject OB in both the adaptation experiment (in which he was exposed \nto the -2.0 feedback transformation) and the control experiment.  In each figure,  the \ndotted line is  OB's  [i]-[a]  path. \nFigure  3(a)  shows  OB's  compensation  responses,  which  were  measured  from  his \nproductions of the training words  made when he heard feedback  of his  whispering. \nThe  solid  arrow  labeled  \"-2.0  xform\"  shows  how  much  his  mean  vowel  formants \nchanged  (testing phase - baseline  phase)  after being exposed  to the -2.0  feedback \ntransformation.  It shows  he  shifted  his  production of [c]  to something  a  bit  past \n[re],  which  corresponds to a p;;tth projection change of slightly more than one vowel \ninterval towards [a].  Thus, since the path projection shift of the transform was -2.0 \n(2.0  vowel  intervals  towards  liD,  the  figure  shows  that  OB  compensates  for  over \nhalf the action of the transformation.  The hollow  arrow in  Figure 3(a)  shows  how \nOB heard his  compensation.  It shows he  heard his  actual production shift from  [c] \ntowards  [a]  as a shift from  [i]  back towards  [c]. \nFigure 3(b) shows how much of OB's compensation was retained when he whispered \nthe training words  with feedback  blocked  by noise ..  This retained compensation is \ncalled adaptation, and it  was  measured from  path projection changes by  the same \nmethod  used  to  measure  compensation.  In  the  figure,  we  see  OB's  adaptation \n\n\f42 \n\nJ.  F.  Houde and M.  I  Jordan \n\nresponse  (the  solid  \"-2.0  xform\"  arrow)  is  a  path  projection  shift  of slightly  less \nthan one vowel interval, so his adaptation is slightly less than half.  Thus, the figure \nshows  that  OB  retains  an appreciable amount  of his  compensation  in  the  absence \nof feedback. \nFinally,  in  both plots  of Figure  3,  the  almost  non-existent  \"control\"  arrows  show \nthat  OB  exhibited  almost  no  formant  change  in  the  control  experiment  - as  we \nwould expect  since feedback  was  never altered in this experiment. \n\n2600 \n\n2400 \n\n2200 \n\nN  2000 \nX \nC'I  1800 \n~ \n\n1600 \n\n1400 \n\n1200 \n\n[i)  ~ .. \n\n:~ [eb;\u00b7\u00b7\u00b7~ ~ \n\n\\. \ncontrol  \"': \nrae)  \\ \n'. \n\n\u00b72.0 dorm \n\n[ab)  \u2022 \n\n2600 \n\n2400 \n\n2200 \n\nN  2000 \nX \n'-' \nC'I  1800 \n~ \n\n1600 \n\n1400 \n\n1200 \n\n[i) \u2022. \n\n'\\\\ \n\n.\\\\. \n\\~ \n\n[ib)\", \n\n\" ' .. \n\n'-.  ~ . \n\n[eb) \ncontrol  , \\. 2.0  dorm \n\n[ae) \n\n\\ \n\\ \n\n[ab)\u00b7 \n\n300 \n\nSOO \n\n700 \n\n900 \n\nII 00 \n\n300 \n\nSOO \n\n700 \n\n900 \n\n1100 \n\nFl  (Hz) \n\nFl  (Hz) \n\n(a)  compensation \n\n(b)  adaptation \n\nFigure 3:  Subject OB  compensation and adaptation. \n\nThe plots in Figure 4 show that there was significant compensation and adaptation \nacross all subjects.  In these plots, the vertical scale indicates how much the changes \nin  mean vowel  formants  (testing phase - baseline  phase)  in  each subject's  produc(cid:173)\ntions of the training words compensated for  the action of the feedback transforma(cid:173)\ntion he was exposed to.  The filled  circles linked by the solid line show compensation \n(Figure  4(a\u00bb  and retained  compensation,  or adaptation  (Figure  4(b))  across  sub(cid:173)\njects in  the adaptation experiment  in  which  feedback was  altered;  the open circles \nlinked  by  the dotted line  show  the  same  measures  from  the  control  experiment  in \nwhich feedback was not altered.  (The solid and dotted lines facilitate comparison of \nresults across subjects but do not signify any relationship between subjects.)  In the \ncontrol experiment, for  each subject, compensation and adaptation were measured \nwith  respect  to the feedback  transformation used in the adaptation experiment. \nThe  plots  show  that  there  are  large  variations  in  compensation  and  adaptation \nacross subjects,  but overall there was  significantly more  compensation  (p  < 0.006) \nand adaptation  (p  < 0.023)  in  the  adaptation experiments  that in  the control  ex(cid:173)\nperiments. \nFigure 5 shows plots of how  much of the adaptation observed in the training words \ncarried over the the testing words.  For each testing word shown,  a  measure of this \ncarryover called  mean generalization is  plotted,  which  was  calculated  as  a  ratio of \nadaptations:  the adaptation seen in the testing word divided by the adaptation seen \nin  the  training  words  (adaptation values  observed in  the  control  experiment  were \n\n\fAdaptation in Speech Motor Control \n\n43 \n\n1.2 \n\n1.0 \n\n0 .80 \n\n0 .60 \n\n0.40 \n\n0.20 \n\n0.00 \n\n-0.20 \n\n-0.40 \n\n1.2 \n\n1.0 \n\n0.80 \n\n0 .60 \n\n0.40 \n\n0.20 \n\n0.00 \n\n-0.20 \n\n-0 .40 \n\nro \n\n.J: . \n.~ \n\nvs \n\nah \n\nty \n\nvs \n\nsr \n\nro \n\nab \n\n\\ \n\n\u00a3 ; \n/rs  \\\" \ni \n/ \n\n\\~_ \n\ncw \n\nsr \n\n(a)  mean compensation \n\n(b)  mean adaptatioll \n\nFigure 4:  Mean compensation and adaptation across all subjects. \n\nsubtracted out to remove any effects not arising from exposure to altered feedback). \nFigure 5(a)  shows  mean generalization for  the context generalization words except \nfor  \"pep\"  (since  \"pep\"  was  also  a  training word) .  The  plot  shows  large  variance \nin  mean  generalization  for  each  of the  three  words,  but  overall  there  was  signifi(cid:173)\ncant  (p  < 0.040)  mean generalization.  Thus,  there was  significant  carryover of the \nadaptation of [c]  in  the training words to different  consonant contexts. \nFigure  5(b)  shows  mean  generalization for  the  vowel  target generalization  words. \nNot  all  of  these  words  are  shown:  unfortunately,  we  weren't  able  to  accurately \nestimate the formants  of  [i]  and  [a],  so  \"peep\"  and  \"pop\"  were  dropped  from  our \ngeneralization  analysis.  For  the  remaining  two  vowel  target  generalization words, \nthe  plot  shows  large  variance  in  mean  generalization  for  each  of the  words,  but \noverall  there  was  significant  (p  <  0.013)  mean  generalization.  Thus,  there  was \nsignificant generalization of the adaptation of [c]  to the vowels  [t]  and [eel. \n\n4  DISCUSSION \n\nSeveral conclusions can be drawn from the experiment described above.  First, com(cid:173)\nparison  of the  adaptation  and  control  experiment  results  seen  in  Figure  4  shows \na  clear  effect  of exposure  to  the  altered feedback:  this  exposure  caused  compen(cid:173)\nsation responses  in most subjects.  Furthermore,  the adaptation results  show  that \nthis  compensation was  retained in absence of acoustic feedback.  Next,  the context \ngeneralization results seen in Figure 5(a) show that some adapted representation of \n[c]  is  shared across  the training and testing words.  These results  provide evidence \nfor  a  hierarchical  speech  production  system  in  which  words  are  composed  from \nsmaller phoneme-like units.  Finally,  the vowel  target generalization results seen in \nFigure  5(b)  show  that  the  production  representations  of  [t],  [cl,  and  [ee]  are  not \nindependent, suggesting that  these  vowels  are  produced  by  controlling a  common \nset of features. \n\n\f44 \n\n2.0 \n\n1.5 \n\n1.  F.  Houde and M. I. Jordan \n\n2.0 \n\n1.5 \n\nQ) \n\n1.0 \n\n.,: \n0 \nc: \n<'S \nQ)  0.50 \n~ \n\npeg \n\ngep \n\nteg \n\nQ) \n\n1.0 \n\n.,: \n0 \nc: \n<'S \nQ)  0.50 \n~ \n\n0.00 \n\n-0.50 \n\n0.00 \n\n\u00b70.50 \n\npap \n\n~ \n\npip \n\nA \n\n(a)  context gen.  words \n\n(b)  vowel  target gen.  words \n\nFigure 5:  Mean generalization for  the testing words,  averaged across subjects. \n\nThus, in  summary, our study has shown  (1)  that speech  production, like  reaching, \ncan be made to exhibit sensorimotor adaptation, and (2)  that this adaptation effect \nexhibits generalization that can  be  used  to make inferences  about  the structure of \nthe speech production system. \n\nAcknowledgments \n\nWe  thank J. Perkell,  K.  Stevens,  R.  Held  and P.  Sabes for  helpful discussions. \n\nReferences \n\n[1]  V.  L.  Gracco et al.,  (1994)  J.  Acoust.  Soc.  Am. 95:2821 \n[2]  H.  V.  Helmholtz,  (1867)  Treatise  on physiological  optics,  Vol.  3 (Eng.  Trans. \n\nby  Optical Soc.  of America,  Rochester,  NY,  1925) \n\n[3]  J.  F.  Houde  (1997),  Sensorimotor  Adaptation  in  Speech  Production,  Doctoral \n\nDissertation,  M.  I.  T.,  Cambridge, MA. \n\n[4)  H.  Kawahara (1993)  J.  Acoust.  Soc.  Am. 94:1883. \n[5]  B.  S.  Lee  (1950)  J.  Acoust.  Soc.  Am.  22:639. \n[6]  W.  J.  M.  Levelt  (1989),  Speaking:  from  intention  to  articulation,  MIT  Press, \n\nCambridge, MA. \n\n[7]  A.  S.  Meyer  (1992),  Cognition 42:18l. \n[8)  R.  B.  Welch  (1986),  Handbook  of Perception  and  Human  Performance,  K.  R. \n\nBoff,  L.  Kaufman,  J. P.  Thomas Eds.,  John Wiley and Sons,  New  York. \n\n\f", "award": [], "sourceid": 1449, "authors": [{"given_name": "John", "family_name": "Houde", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}